The Hebrew Dotted Text Corpus contains mainly news and articles from the Shaar LaMatchil newspaper and a single article from the Yanshuf newspaper. Both are periodical news publications written for adults with beginner Hebrew levels. Uniquely, this corpus is written in Hebrew with dots (niqqud/vocalization), intended to clarify pronunciation ambiguities.
Shaar LaMatchil
-
Plain Text
Use cp1255 encoding to properly view the files.
Password-protected. Please register to access (free for non-commercial use).
-
Tokenized Text in XML
The XML schema follows MILA's corpus standards.
Password-protected. Please register to access (free for non-commercial use).
Shaar LaMatchil Corpus Statistics
- 8,419 tokens total
- 8,337 Hebrew-only tokens (excluding numbers, punctuation, etc.)
- 4,811 types total
Yanshuf
-
Plain Text
Use cp1255 encoding to properly view the files.
Password-protected. Please register to access (free for non-commercial use).
-
Tokenized Text in XML
The XML schema follows MILA's corpus standards.
Password-protected. Please register to access (free for non-commercial use).
Yanshuf Corpus Statistics
- 11,946 tokens total
- 6,002 types total
License
This resource can be used freely for research purposes only (please register to access password-protected files). For copyright reasons, this corpus is unavailable for commercial usage. Any publication resulting from the use of this corpus should refer to it as "The MILA Dotted Text Corpus" and cite:
Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008. [BibTeX]
View all corpora...
View corpus standards...
Register to access the password-protected corpora files for non-commercial purposes...
