Text tokenization divides text into meaningful units like words, sentences, and paragraphs. Many languages, including Hebrew, have explicit boundary markers for words (spaces and some punctuation marks) and sentences (periods), but these are sometimes ambiguous.
The MILA Hebrew Tokenization Tool divides inputted undotted Hebrew text into tokens, sentences, and paragraphs, and the XML output follows MILA's standards for corpora.
Online Demo (segments into tokens)
Enter undotted Hebrew text:
Segments input into tokens, sentences, and paragraphs.
XML output follow MILA's XML standards for corpora.
Password-protected. Please register to access (free for non-commercial use).
- Developed by Dalia Bojan.
- Maintained by Yossi Jacob, MILA Research Engineer (contact).
For non-commercial research purposes, this tool is licensed under the GNU General Public License (GPL). Any publications resulting from the use of this tool should refer to it as "The MILA Hebrew Tokenization Tool" and cite: