The Wikipedia Corpus contains the text contents, as of 2010, of Hebrew Wikipedia, an online, free-content encyclopedia that is openly editable and collaboratively-written. Founded in July 2003, Hebrew Wikipedia grew to over 100,000 articles by January 2010.
To reduce file processing time, Wikipedia pages with fewer than 2000 bytes (about 1000 UTF-8 Hebrew characters) were removed from the corpus, as they contained little linguistic information.
-
Plain Text
Use cp1255 encoding to properly view the files.
Password-protected. Please register to access (free for non-commercial use).
-
Tokenized Text in XML
The XML schema follows MILA's corpus standards.
Password-protected. Please register to access (free for non-commercial use).
-
Morphologically-Analyzed Text in XML
Tokenized text tagged with all possible morphological analyses.
The XML schema follows MILA's corpus standards.
Password-protected. Please register to access (free for non-commercial use).
Corpus Statistics
- 133,271,332 tokens total
- 1,716,031 types total.
- Please note that a non-negligible portion of the types are non-Hebrew words (especially English, Arabic, and Chinese), numbers, typos, and words that were not tokenized properly (since they contained punctuation in the middle of the word or just two words that were stuck together). Without punctuation and non-Hebrew words, there are 96,658,599 tokens and 1,022,036 types.
License
This resource can be used freely for research purposes only (please register to access password-protected files). For copyright reasons, this corpus is unavailable for commercial usage. Any publication resulting from the use of this corpus should refer to it as "The MILA Wikipedia Corpus" and cite:
Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008. [BibTeX]
View all corpora...
View corpus standards...
Register to access the password-protected corpora files for non-commercial purposes...
