Produced by Yoav Goldberg, Bar Ilan University
Based on a dump of Hebrew Wikipedia as of September 2013.
Details about the processing and the annotation scheme are availale in the tech report.
Raw Sentences (191MB)
The raw sentences, after filtering of tables and other "non-sentence" material. In UTF-8 formatted text, one sentence per line. Very basic sentence splitting. Processing of the wikipedia dump was done by Oded Avraham.
# num of sentences: $ zcat wikipedia.raw.gz |wc -l 3833140
Parsed Sentences (819MB)
The raw corpus after being tokenized using Yoav Goldberg's hebtokenizer.py script, tagged using Meni Adler's tagger, and parsed using a state-of-the-art dependency parser (EasyFirst parser with error-exploration training). The tagger relies on morphological analyses provided by the MILA morphological analyzer as well as an additional statistical model. The parser is trained on the tb-v3 version of the Hebrew treebank. The annotation scheme follows the guidelines in Yoav Goldberg's PhD thesis, with dependency labels assigned by Reut Tsarfaty according to her Universal Stanford Dependencies scheme.
For questions, please contact Yoav Goldberg (first.last@gmail).
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. If used in academic research, it would be nice if you cite the tech report above.