The HaAretz Corpus contains news articles from the HaAretz daily newspaper during 1990-1991.
A small subset of this corpus has been hand-annotated with full word segmentation and morpho-syntactic anlaysis as part of the Treebank project.
-
Plain Text
Password-protected (please register to access; free for non-commercial use).
-
Tokenized Text in XML
The XML schema follows MILA's corpus standards.
Password-protected (please register to access; free for non-commercial use).
-
Morphologically Disambiguated Text in XML
Tokenized text manually tagged with all possible morphological analyses.(240 files)
Inappropriate analyses (for the sentence context) are given a score of 0, and appropriate analyses are given a positive score.
The XML schema follows MILA's corpus standards.
Password-protected (please register to access; free for non-commercial use).
-
Morphologically Disambiguated Text in XML
The entire corpus. Tokenized text tagged with all possible morphological analyses.
Inappropriate analyses (for the sentence context) are given a score of 0, and appropriate analyses are given a positive score.
The XML schema follows MILA's corpus standards.
Password-protected (please register to access; free for non-commercial use).
-
Statistics
For the morphologically manually disambiguated text (240 files).
Corpus Statistics
- 11,097,790 tokens total
- 305,545 types total
- 500 most frequent tokens, by frequency
- All tokens appearing 10+ times, by frequency
- 1000 most frequent bigrams, by frequency
- All bigrams, by frequency
License
This resource can be used freely for research purposes only (please register to access password-protected files). For copyright reasons, this corpus is unavailable for commercial usage. Any publication resulting from the use of this corpus should refer to it as "The MILA HaAretz Corpus" and cite:
Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008. [BibTeX, EndNote]
View all corpora...
View corpus standards...
Register to access the password-protected corpora files for non-commercial puposes...
