Corpora
Textual corpora that document language use are invaluable for research in various areas of linguistics, as well as for collecting statistical information that facilitates the construction of a variety of natural language processing applications. MILA has collected or acquired a number of Hebrew corpora from various domains. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
All corpora follow the standards developed by MILA.
| Corpus | Description | # Tokens | # Types |
|---|---|---|---|
| HaAretz | News and articles from the HaAretz news website, 1990-1991. | 11,097,790 | 305,545 |
| Arutz 7 | News and articles from the Arutz 7 news website, 2001-2006. | 15,107,618 | 323,943 |
| TheMarker | Articles from the TheMarker financial newspaper, May - October 2002. | 692,919 | 62,216 |
| HaKnesset | Session protocols of the Knesset (Israeli Parliament) during January 2004 - November 2005. | 15,066,731 | 204,967 |
| Wikipedia | Articles from the Hebrew Wikipedia online encyclopedia, 2010. | 133,271,332 | 1,716,031 |
| Doctors | Articles from the Doctors medical website. | 232,695 | |
| Infomed | Question and answer discussions from the Infomed website's medical forum, January 2006 - September 2007. | 189,586 | |
| Nature of Healing | Articles and forum discussions from the Nature of Healing neuropathy medical website. | 75,969 | |
| To Be Healthy | Articles and forum discussions from the To Be Healthy (L'Hiyot Bari, 2b-bari) medical website. | 839,899 | |
| Tapuz People Forum | Forum discussions from the Tapuz People website, on a variety of subjects. | 1,397,173 | |
| Hebrew CHILDES | Spoken Hebrew conversations between children and between children and adults. | ||
| Spoken Israeli Hebrew | Spoken Hebrew conversations and parts of the Corpus of Spoken Israeli Hebrew (CoSIH). | 92,838 | 11,635 |
| Hebrew Dotted Text | Articles from beginner-Hebrew newspapers Shaar LaMatchil and Yanshuf. Text includes dots (niqqud/vocalization). |
Shaar LaMatchil: 8,419 Yanshuf: 11,946 |
Shaar LaMatchil: 4,811 Yanshuf: 6,002 |
