The Spoken Israeli Hebrew Corpus contains the transcripts of spoken Hebrew conversations and parts of The Corpus of Spoken Israeli Hebrew (CoSIH). The speakers' names and identifying details within have been removed or changed to ensure their anonymity. The original text was provided by Tel Aviv University's Shlomo Izre'el and Esti Borochovsky Bar Aba from the CoSIH project.
-
Plain Text
Use cp1255 encoding to properly view the files.
Password-protected. Please register to access (free for non-commercial use).
-
Tokenized Text in XML
The XML schema follows MILA's corpus standards.
Password-protected. Please register to access (free for non-commercial use).
-
Morphologically-Analyzed Text in XML
Tokenized text tagged with all possible morphological analyses.
The XML schema follows MILA's corpus standards.
Password-protected. Please register to access (free for non-commercial use).
Corpus Statistics
- 92,838 tokens total
- 11,635 types total
- 500 most frequent tokens, by frequency
- All tokens appearing 10+ times, by frequency
- 3000 most frequent bigrams, by frequency
- All bigrams, by frequency
License
This resource can be used freely for research purposes only (please register to access password-protected files). For copyright reasons, this corpus is unavailable for commercial usage. Any publication resulting from the use of this corpus should refer to it as "The MILA Spoken Israeli Hebrew Corpus" and cite:
Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008. [BibTeX]
View all corpora...
View corpus standards...
Register to access the password-protected corpora files for non-commercial purposes...
