Lexical Reference Rules from Wikipedia

[Acquiring the resource | DB tables | Quering the DB | Eyal Shnarch's homepage]

This is a rule-base of about 8 million lexical reference rules extracted from Wikipedia as described in our ACL paper from 2009 (see reference below). A lexical reference rule is a directional relation identifying a concrete reference from its left-hand-side (lhs) to its right-hand-side (rhs). For instance, Bentley --> luxury car, physician --> medicine and Abbey Road --> The Beatles. This relation is more specific than similarity or association rules and more general than the regular lexical relations (i.e. synonym, hyponym etc.).

This rule-base is shown to perform better than other automatically constructed baselines in couple of lexical expansion and matching tasks. Our rule-base yields comparable performance to Word-Net while providing largely complementary information.

When using this rule-base, please refer to following publication: Eyal Shnarch, Libby Barak, Ido Dagan. 2009. Extracting Lexical Reference Rules from Wikipedia. In Proceedings of ACL.

Acquiring the resource

The rule-base is available either as a database dump or as CSV files.

DB tables

  1. terms:

  2. Includes one row per term (possibly multi-words expressions) that participate in a rule. Note that the terms in this table are case sensitive, enabling the distinction between a term in upper and lower case. However MySql select query is not always case sensitive so this feature may lose its property.

    Indexes: primary on id, term and another one on term.

  3. rules:

  4. Includes one row per rule. Each rule has left-had-side (lhs) and right-hand-side (rhs). The rules are directional; the lhs makes a reference to the rhs. A bidirectional rule is represented by two separate rows in this table.

    The method column indicates which of the 5 extraction methods described in the paper extracted the rule (1-Redirect, 2-Be Complement, 3-Parenthesis, 4-Link, 5-All Nouns). For a rule that was extracted by more than one method this column contains all methods indicators separated by @.

    The pattern column holds a string representation of the syntactic path connecting the lhs to the rhs (only for rules that were extracted by the All Nouns method).

    The rule_perc column indicate the percentile rank of the pattern of the rule (again, only for rules that were extracted by the All Nouns method). You can use this value to split the long list of All Nouns rules into groups. In our paper we split this list into 3 parts.

    Indexes: primary on lhs, rhs and another one on rhs.

  5. rules_counts

  6. Gathers for each rule some statistical counters:

    • The number of Wikipedia articles containing the lhs of the rule.
    • The number of Wikipedia articles containing the rhs of the rule.
    • The number of Wikipedia articles containing both the lhs and the rhs.

    • Note: for a multi-word expression all words should appear continuously in an article.

    • The Dice coefficient of the rule, claculated from the above counters, can be used for rule filtering:

    • 2*count(both)
      count(rhs) + count(lhs)

    Indexes: primary on lhs, rhs, one on dice and another one on condDice.

Quering the DB

A query that finds rules whose lhs is TERM and Dice score is above THRESHOLD:

SELECT lhs.term, rhs.term, r.method, rc.dice, rc.lhs_count, rc.rhs_count, rc.rule_count, r.pattern , r.rule_perc
FROM terms lhs, rules r, terms rhs, rules_counts rc
WHERE = r.rhs and = r.lhs and r.lhs = rc.lhs and r.rhs = rc.rhs and lhs.term = 'TERM ' and rc.dice > THRESHOLD ;