![]() |
עברית
|
|
Events
|
Courses
|
People
|
Products
|
|
Hebrew Corpus Of Arutz7 NewswiresLicenseOriginal content is copyright of Arutz 7. Annotated corpus files, statistics and other information provided in this page are all copyright of Mila Knowledge Center The information provided in this page is free. The Hebrew corpus, its resources and products are all licensed under the GNU Free Documentation License. BackgroundThe Corpus contains news and articles from Arutz 7 since 2001, which updates daily (until 2006). Text is available in HTML, plain ascii text, tokenized text in XML format. It is possible to obtain an XML version of the text morphologically annotated (with all possible analyses) and morphologically disambiguated (with the correct morphological analysis in context). Every day, the front page of Arutz 7 is being scanned for updated news and articles and new material is being downloaded. The relevant text is being extracted from the downloaded pages, and then analyzed for document structure (paragraph, sentence and token segmentation). The texts are then being represented in XML according to the Hebrew Corpus schema.
The corpus contains 323,943 types.
DownloadsResearchers interested in obtaining access to the resources described in this page are kindly asked to fill the registration form
Other Corpuses which can be download from Mila site are: |