DIRECT - Directional Distributional Term-Similarity Resource

This is a resource of directional distributional term-similarity rules automatically extracted using the inclusion relation as described in (Kotlerman et.al., JNLE-DLS 2010). It is shown to perform better than other automatically constructed distributional resources in two lexical expansion tasks - Information Extraction (IE) and keyword-based Text Categorization (TC).


  • References:
    • Lili Kotlerman, Ido Dagan, Idan Szpektor and Maayan Zhitomirsky-Geffet. Directional Distributional Similarity for Lexical Inference. Special Issue of Natural Language Engineering on Distributional Lexical Semantics (JNLE-DLS). Cambridge University Press, 2010. To appear.
    • Lili Kotlerman, Ido Dagan, Idan Szpektor and Maayan Zhitomirsky-Geffet. Directional Distributional Similarity for Lexical Expansion. In Proceedings of ACL, 2009. [pdf]
  • Contact: Lili Kotlerman, lili.dav @ gmail.com

Description in more detail

  • Most of the rules are lexical entailment rules, where the meaning of the rule's left-hand-side implies the meaning of its right-hand-side. For instance: koala-->animal, bread-->food, imprisonment-->arrest, wedding-->marriage. In addition, the resource contains so called context rules, where there is no meaning entailment, but the left-hand-side of the rule is rather an indicative word for contexts in which the right-hand-side tends to occur. For example, court-->lawyer.
  • The scores assigned to the rules tend to behave as follows:
    • Given a term, its lexical entailments are ranked higher than somewhat more loose context terms.
    • The correct direction of a term-relation can be usually defined using the scores, e.g. koala-->animal will be scored higher than animal-->koala. Similar high scores in both directions usually indicate synonyms.
  • Corpus: Reuters RCV-1
  • Features: To characterize a term we used words related to it via a syntactic dependency relation.

Download

Each rule-base is available as a tab-delimited .txt file (zipped): left-hand-side  right-hand-side  score.
Left-hand-side and right-hand-side are lowercased lemmas (possibly multi-word) as detected by the Minipar parser.
All numerical digits are replaced with @ symbols.

  • Up to 200 rules per a right-hand-side term:
      - Nouns: over 1.5 million rules - download (zip - 20 MB)
      - Verbs: over 50.000 rules - download (zip - 0.5 MB)
  • Up to 1000 rules per a right-hand-side term:
      - Nouns: over 6 million rules - download (zip - 105 MB)
      - Verbs: over 1 million rules - download (zip - 17 MB)

Using the resource

  • Installation
    • Download the .zip file you need and extract the tab-delimited file.txt file
    • Create a table myTable in some database
      CREATE TABLE myTable ( lhs VARCHAR(512) NOT NULL, rhs VARCHAR(512) NOT NULL, score DECIMAL(64,30) UNSIGNED NOT NULL ) ;
    • Load the rules from the file file.txt into the table you created
      LOAD DATA INFILE "file.txt" INTO TABLE myTable ;
    • Suggested indexes: primary on lhs, rhs and another one on rhs
  • Quering the DB
      A query that for a given right-hand-side term TERM finds a list of entailing (left-hand-side) terms with a score above THRESHOLD (list ordered by score)
    • SELECT lhs, score FROM myTable WHERE rhs = "TERM" AND score > THRESHOLD ORDER BY score DESC