Textual corpora that document language use are invaluable for research in various areas of linguistics, as well as for collecting statistical information that facilitates the construction of a variety of natural language processing applications. MILA has collected or acquired a number of Hebrew corpora from various domains. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
All corpora follow the standards developed by MILA.
|Corpus||Description||# Tokens||# Types|
|HaAretz||News and articles from the HaAretz news website, 1990-1991.||11,097,790||305,545|
|Arutz 7||News and articles from the Arutz 7 news website, 2001-2006.||15,107,618||323,943|
|TheMarker||Articles from the TheMarker financial newspaper, May - October 2002.||692,919||62,216|
|HaKnesset||Session protocols of the Knesset (Israeli Parliament) during January 2004 - November 2005.||15,066,731||204,967|
|Wikipedia||Articles from the Hebrew Wikipedia online encyclopedia, 2010.||133,271,332||1,716,031|
|Doctors||Articles from the Doctors medical website.||232,695|
|Infomed||Question and answer discussions from the Infomed website's medical forum, January 2006 - September 2007.||189,586|
|Nature of Healing||Articles and forum discussions from the Nature of Healing neuropathy medical website.||75,969|
|To Be Healthy||Articles and forum discussions from the To Be Healthy (L'Hiyot Bari, 2b-bari) medical website.||839,899|
|Tapuz People Forum||Forum discussions from the Tapuz People website, on a variety of subjects.||1,397,173|
|Hebrew CHILDES||Spoken Hebrew conversations between children and between children and adults.|
|Spoken Israeli Hebrew||Spoken Hebrew conversations and parts of the Corpus of Spoken Israeli Hebrew (CoSIH).||92,838||11,635|
|Hebrew Dotted Text||Articles from beginner-Hebrew newspapers Shaar LaMatchil and Yanshuf.
Text includes dots (niqqud/vocalization).
|Shaar LaMatchil: 8,419
|Shaar LaMatchil: 4,811