Produced by Yoav Goldberg, Bar Ilan University
Based on a dump of the Ben Yehuda Project provided by Asaf Bartov, Feb 2014.
Sentence splitting was done using a simple heuristic script. Details about the tokenization/tagging/parsing and the annotation scheme are availale in the tech report about the Hebrew Wikipedia corpus.
The raw data as provided by Asaf Bartov. The data does not include diacritic marks.
Mapping from file names to authors and titles. Provided by Asaf Bartov.
Raw data after removal of some template elements and sentence splitting. One sentence per line. Paragraphs separated by blank lines.
The data in sentences.tgz tokenized using the hebtokenizer.py script.
Tagged, lemmatized and parsed data.
Tagging/lemmatizing is performed using Meni Adler's tagger, and parsing is performed using a state-of-the-art dependency parser (EasyFirst parser with error-exploration training). The tagger relies on morphological analyses provided by the MILA morphological analyzer as well as an additional statistical model. The parser is trained on the tb-v3 version of the Hebrew treebank. The annotation scheme follows the guidelines in Yoav Goldberg's PhD thesis, with dependency labels assigned by Reut Tsarfaty according to her Universal Stanford Dependencies scheme.
For questions, please contact Yoav Goldberg (first.last@gmail).
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. If used in academic research, it would be nice if you cite the tech report above.