Corpora

Textual corpora that document language use are invaluable for research in various areas of linguistics, as well as for collecting statistical information that facilitates the construction of a variety of natural language processing applications. MILA has collected or acquired a number of Hebrew corpora from various domains. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.

All corpora follow the standards developed by MILA.

Corpus Description # Tokens # Types
HaAretz News and articles from the HaAretz news website, 1990-1991. 11,097,790 305,545
Arutz 7 News and articles from the Arutz 7 news website, 2001-2006. 15,107,618 323,943
TheMarker Articles from the TheMarker financial newspaper, May - October 2002. 692,919 62,216
HaKnesset Session protocols of the Knesset (Israeli Parliament) during January 2004 - November 2005. 15,066,731 204,967
Wikipedia Articles from the Hebrew Wikipedia online encyclopedia, 2010. 133,271,332 1,716,031
Wikipedia 2013 Articles from the Hebrew Wikipedia online encyclopedia, 2013.
Doctors Articles from the Doctors medical website. 232,695
Infomed Question and answer discussions from the Infomed website's medical forum, January 2006 - September 2007. 189,586
Nature of Healing Articles and forum discussions from the Nature of Healing neuropathy medical website. 75,969
To Be Healthy Articles and forum discussions from the To Be Healthy (L'Hiyot Bari, 2b-bari) medical website. 839,899
Tapuz People Forum Forum discussions from the Tapuz People website, on a variety of subjects. 1,397,173
Hebrew CHILDES Spoken Hebrew conversations between children and between children and adults.
Spoken Israeli Hebrew Spoken Hebrew conversations and parts of the Corpus of Spoken Israeli Hebrew (CoSIH). 92,838 11,635
Hebrew Dotted Text Articles from beginner-Hebrew newspapers Shaar LaMatchil and Yanshuf.
Text includes dots (niqqud/vocalization).
Shaar LaMatchil: 8,419
Yanshuf: 11,946
Shaar LaMatchil: 4,811
Yanshuf: 6,002
Dependency parsed corpora A dependency parsed corpus.
The corpus is part of the Hebrew Wikipedia corpus and the dependencies were created by Yoav Goldberg’s automatic dependency parser.
65,541,436