![]() |
עברית
|
|
Events
|
Courses
|
People
|
Products
|
|
Hebrew Treebank Project Version 2.0LicenseOriginal content is copyright of Haaretz. The contents of this page, the annotation guides, the parse trees, their products and their various formats are all published under the GNU Free Documentation License. BackgroundThe Hebrew Treebank Version 2.0 contains 6500 sentences of news items from the Ha'aretz daily newspaper, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. A version of the Treebank with only the morphological level is also supplied. Version 2.0 of the Hebrew Treebank contains three enhancements and improvements as compared to previous versions:
Tag set: The Hebrew Treebank was designed with a tag set that is as close as possible to that of the English Penn Treebank. A significant difference from English is that in the Hebrew Treebank the annotated words are separated into morphemes, which may have different POS tags and syntactic positions in the tree. For example, the two words BBIT HGDWL ("in the big house") are analyzed as five different morphemes: B H BIT H GDWL in the house the big Note that the first occurrence of the morpheme H ("the") is covert in the word BBIT. Segmentation into morphemes makes it possible to analyze different morphemes of the same word as belonging to different constituents in the tree: [B [[H BIT] [H GDWL]]
Another significant difference from English is the relatively free order of constituents in Hebrew.
*NEW* Father-child dependencies
Version 2.0 of the Hebrew Treebank uses special annotation features to mark all cases where the morpho-syntactic features of a node are inherited from one or more of its children. We refer to such cases as "father-child dependencies". In a structure where X is a node and Y is one of X's children, a dependency between X and Y is marked by adding to Y the feature DEP_ We distinguish six classes of dependency features.
Further documentationFor a detailed description of the linguistic annotation scheme of the previous version (without the dependency annotation), see Sima'an et al (2001). For the full conventions used in Version 2.0, with sample examples, see the annotator guidelines .Download - original text in hebrewDownload - treebank files WEB Viewing Version
Download - treebank files SEMTAGS Version
Download - treebank files transliterated Version
SoftwareThe following software was used for the development of the treebank:
CommentsNull trees: the original text was automatically segmented into sentences. In some cases, a single sentence in the original text was splitted into multiple sentences. Therefore, in the annotation process it was necessary to rejoin sentences. In order to maintain the synchronization between the tree numbers and the automatically segmented sentence numbers, null trees were inserted. For example, if the first sentences was splitted into three sentences, #1, #2 and #3, then in the treebank, tree #1 will include the three sentence parts, while tree #2 and tree #3 will be null trees. A null tree looks like this:
S and is represented as: ((S (yyDOT yyDOT))) Duplicate sentences: sentences 24-36 in the original text do not appear in the treebank, since they are repeated in sentences 1358-1370. Sentences 1249-1293 do not appear in the treebank, since they are repeated in sentences 1204-1248. Missing trees: the following sentences currently do not appear in the treebank (represented by null trees): 552, 772, 1350, 1382, 2044, 3206 Transliteration
StaffPublications |