Tokenization

Text tokenization divides text into meaningful units like words, sentences, and paragraphs. Many languages, including Hebrew, have explicit boundary markers for words (spaces and some punctuation marks) and sentences (periods), but these are sometimes ambiguous.

The MILA Hebrew Tokenization Tool divides inputted undotted Hebrew text into tokens, sentences, and paragraphs, and the XML output follows MILA's standards for corpora.


  • Online Demo (segments into tokens).
    Enter undotted Hebrew text:

  • Full Program
    Segments input into tokens, sentences, and paragraphs.
    XML output follow MILA's XML standards for corpora.
    Password-protected Password-protected. Please register to access (free for non-commercial use).

Credits

  • Developed by Dalia Bojan.
  • Maintained by Slava Demender, MILA Research Engineer (contact).

License

For non-commercial research purposes, this tool is licensed under the GNU General Public License (GPL). Any publications resulting from the use of this tool should refer to it as "The MILA Hebrew Tokenization Tool" and cite:

Alon Itai and Shuly Wintner. "Language Resources for Hebrew." Language Resources and Evaluation 42(1):75-98, March 2008. [BibTeX]

To gain password access to this tool for non-commercial purposes, please register. For commercial usage, please contact MILA to inquire about terms.