logo
עברית  | home_gif | contact_gif | Events  |  Courses  |  People  |  Products

Fast Links
Hebrew HMM Tagger
Morphological Analyzer SOAP Web Service

Hebrew Corpus Of Arutz7 Newswires

License

Original content is copyright of Arutz 7.

Annotated corpus files, statistics and other information provided in this page are all copyright of Mila Knowledge Center

The information provided in this page is free. The Hebrew corpus, its resources and products are all licensed under the GNU Free Documentation License.

Background

The Corpus contains news and articles from Arutz 7 since 2001, which updates daily (until 2006). Text is available in HTML, plain ascii text, tokenized text in XML format. It is possible to obtain an XML version of the text morphologically annotated (with all possible analyses) and morphologically disambiguated (with the correct morphological analysis in context).

Every day, the front page of Arutz 7 is being scanned for updated news and articles and new material is being downloaded. The relevant text is being extracted from the downloaded pages, and then analyzed for document structure (paragraph, sentence and token segmentation). The texts are then being represented in XML according to the Hebrew Corpus schema.

The corpus contains 323,943 types.
The corpus contains 15,107,618 tokens.

Downloads

Researchers interested in obtaining access to the resources described in this page are kindly asked to fill the registration form

Other Corpuses which can be download from Mila site are:

  • The Medical Corpus
  • The Spoken Israeli Hebrew Corpus
  • The Knesset Protocols Corpus
  • The Economical Corpus - TheMarker Corpus
  • Hebrew With Nikud Corpus



    Copyright (C); Mila . All Rights Reserved מילה (C); כל הזכויות שמורות
    Design downloaded from FreeWebTemplates.com
    Free web design, web templates, web layouts, and website resources!