Syntactically processed verion of the Ben Yehuda Project corpus, V.0.9

Produced by Yoav Goldberg, Bar Ilan University

Based on a dump of the Ben Yehuda Project provided by Asaf Bartov, Feb 2014.

Sentence splitting was done using a simple heuristic script. Details about the tokenization/tagging/parsing and the annotation scheme are availale in the tech report about the Hebrew Wikipedia corpus.

Download

Raw data
The raw data as provided by Asaf Bartov. The data does not include diacritic marks.

pseudocatalogue.csv
Mapping from file names to authors and titles. Provided by Asaf Bartov.

sentences.tgz
Raw data after removal of some template elements and sentence splitting. One sentence per line. Paragraphs separated by blank lines.

tokenized.tgz
The data in sentences.tgz tokenized using the hebtokenizer.py script.

parsed.tgz
Tagged, lemmatized and parsed data.
Tagging/lemmatizing is performed using Meni Adler's tagger, and parsing is performed using a state-of-the-art dependency parser (EasyFirst parser with error-exploration training). The tagger relies on morphological analyses provided by the MILA morphological analyzer as well as an additional statistical model. The parser is trained on the tb-v3 version of the Hebrew treebank. The annotation scheme follows the guidelines in Yoav Goldberg's PhD thesis, with dependency labels assigned by Reut Tsarfaty according to her Universal Stanford Dependencies scheme.


For questions, please contact Yoav Goldberg (first.last@gmail).

Creative
Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. If used in academic research, it would be nice if you cite the tech report above.