DIRECT - Directional Distributional Term-Similarity Resource
This is a resource of directional
distributional term-similarity rules automatically extracted using the inclusion relation
as described in (Kotlerman et.al., JNLE-DLS 2010). It is shown to perform better than other automatically constructed distributional resources in two lexical expansion tasks - Information Extraction (IE) and keyword-based Text Categorization (TC).
Lili Kotlerman, Ido Dagan, Idan Szpektor and Maayan Zhitomirsky-Geffet. Directional Distributional Similarity for Lexical Inference. Special Issue of Natural Language Engineering on Distributional Lexical Semantics (JNLE-DLS). Cambridge University Press, 2010. To appear.
Lili Kotlerman, Ido Dagan, Idan Szpektor and Maayan Zhitomirsky-Geffet. Directional Distributional Similarity for Lexical Expansion. In Proceedings of ACL, 2009. [pdf]
- Contact: Lili Kotlerman, lili.dav @ gmail.com
Description in more detail
- Most of the rules are lexical entailment rules, where the meaning of the rule's left-hand-side implies the meaning of its right-hand-side. For instance: koala-->animal, bread-->food, imprisonment-->arrest, wedding-->marriage. In addition, the resource contains so called context rules, where there is no meaning entailment, but the left-hand-side of the rule is rather an indicative word for contexts in which the right-hand-side tends to occur. For example, court-->lawyer.
- The scores assigned to the rules tend to behave as follows:
- Given a term, its lexical entailments are ranked higher than somewhat more loose context terms.
- The correct direction of a term-relation can be usually defined using the scores, e.g. koala-->animal will be scored higher than animal-->koala. Similar high scores in both directions usually indicate synonyms.
- Corpus: Reuters RCV-1
- Features: To characterize a term we used words related to it via a syntactic dependency relation.
Each rule-base is available as a tab-delimited .txt file (zipped): left-hand-side right-hand-side score
are lowercased lemmas
(possibly multi-word) as detected by the Minipar parser.
All numerical digits are replaced with @ symbols.
- Up to 200 rules per a right-hand-side term:
- Nouns: over 1.5 million rules - download (zip - 20 MB)
- Verbs: over 50.000 rules - download (zip - 0.5 MB)
- Up to 1000 rules per a right-hand-side term:
- Nouns: over 6 million rules - download (zip - 105 MB)
- Verbs: over 1 million rules - download (zip - 17 MB)
Using the resource
- Download the .zip file you need and extract the tab-delimited file.txt file
- Create a table myTable in some database
CREATE TABLE myTable ( lhs VARCHAR(512) NOT NULL, rhs VARCHAR(512) NOT NULL, score DECIMAL(64,30) UNSIGNED NOT NULL ) ;
- Load the rules from the file file.txt into the table you created
LOAD DATA INFILE "file.txt" INTO TABLE myTable ;
- Suggested indexes: primary on lhs, rhs and another one on rhs
- Quering the DB
A query that for a given right-hand-side term TERM finds a list of entailing (left-hand-side) terms with a score above THRESHOLD (list ordered by score)
- SELECT lhs, score FROM myTable WHERE rhs = "TERM" AND score > THRESHOLD ORDER BY score DESC