Arutz 7 Corpus

The Arutz 7 Corpus contains news and articles from the Arutz 7 website during the years 2001-2006.

Every day during 2001-2006, the front page of Arutz 7 was scanned for updated news and articles, and new material was downloaded. The relevant text was extracted from the downloaded pages, and then analyzed for document structure (paragraph, sentence and token segmentation).


  • Plain Text
    Use cp1255 encoding to properly view the files.
    Password-protected Password-protected. Please register to access (free for non-commercial use).
  • Tokenized Text in XML
    The XML schema follows MILA's corpus standards.
    Password-protected Password-protected. Please register to access (free for non-commercial use).
  • Morphologically Disambiguated Text in XML
    Tokenized text tagged with all possible morphological analyses.
    Inappropriate analyses (for the sentence context) are given a score of 0, and appropriate analyses are given a positive score.
    The XML schema follows MILA's corpus standards.
    Password-protected Password-protected. Please register to access (free for non-commercial use).

Corpus Statistics

License

This resource can be used freely for research purposes only (please register to access password-protected files). For copyright reasons, this corpus is unavailable for commercial usage. Any publication resulting from the use of this corpus should refer to it as "The MILA Arutz 7 Corpus" and cite:

Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008. [BibTeX]

> View all corpora...
> View corpus standards...
> Register to access the password-protected corpora files for non-commercial purposes...