Lexical Reference Rules from Wikipedia

This is a rule-base of about 8 million lexical reference rules extracted from Wikipedia as described in (Shnarch et al., 2009). A lexical reference rule is a directional relation identifying a concrete reference from its left-hand-side (lhs) to its right-hand-side (rhs). For instance, Bentley --> luxury car, physician --> medicine and Abbey Road --> The Beatles. This relation is more specific than similarity or association rules and more general than the regular lexical relations (i.e. synonym, hyponym etc.).

This rule-base is shown to perform better than other automatically constructed baselines in couple of lexical expansion and matching tasks. Our rule-base yields comparable performance to Word-Net while providing largely complementary information.

When using this rule-base, please refer to following publication: Eyal Shnarch, Libby Barak, Ido Dagan. 2009. Extracting Lexical Reference Rules from Wikipedia. In Proceedings of ACL.


Contact: Eyal Shnarch, shey @ cs.biu.ac.il



    Acquiring the resource

    The rule-base is available either as a database dump or as CSV files.

    • To import it into a MySQL DB follow theses steps:
      • Download the file wiki_dump.zip and extract the file wiki_dump.sql in it.
      • Create a database named wikiRules (or any other name).
      • Run the following MySQL import command from the bin directory of your MySQL (e.g. C:\Program Files\MySQL\MySQL Server 5.1\bin):
        mysql -uusername -ppassword wikiRules < wiki_dump.sql
        (note that there should not be a space between -u and -p to username and password respectively nor between the minus sign and u or p)

      This command should create all tables, indexes and insert all the data. Bear in mind that the tables here contain few millions rows, this may take a while.

    • To obtain the 3 tables of the database as CSV (Comma Separated Values) files, download the file tables.zip.
      For better performance, remember to add indexes on the tables as described below.



    DB tables

    • terms:

    • Includes one row per term (possibly multi-words expressions) that participate in a rule. Note that the terms in this table are case sensitive, enabling the distinction between a term in upper and lower case. However MySql select query is not always case sensitive so this feature may lose its property.

      Indexes: primary on id, term and another one on term.

    • rules:

    • Includes one row per rule. Each rule has left-had-side (lhs) and right-hand-side (rhs). The rules are directional; the lhs makes a reference to the rhs. A bidirectional rule is represented by two separate rows in this table.

      The method column indicates which of the 5 extraction methods described in the paper extracted the rule (1-Redirect, 2-Be Complement, 3-Parenthesis, 4-Link, 5-All Nouns). For a rule that was extracted by more than one method this column contains all methods indicators separated by @.

      The pattern column holds a string representation of the syntactic path connecting the lhs to the rhs (only for rules that were extracted by the All Nouns method).

      The rule_perc column indicate the percentile rank of the pattern of the rule (again, only for rules that were extracted by the All Nouns method). You can use this value to split the long list of All Nouns rules into groups. In our paper we split this list into 3 parts.

      Indexes: primary on lhs, rhs and another one on rhs.

    • rules_counts

    • Gathers for each rule some statistical counters:

      • The number of Wikipedia articles containing the lhs of the rule.
      • The number of Wikipedia articles containing the rhs of the rule.
      • The number of Wikipedia articles containing both the lhs and the rhs.

      • Note: for a multi-word expression all words should appear continuously in an article.

      • The Dice coefficient of the rule, claculated from the above counters, can be used for rule filtering:

      • 2*count(both)
        count(rhs) + count(lhs)

      Indexes: primary on lhs, rhs and another one on dice.



      Quering the DB

      A query that finds rules whose lhs is TERM and Dice score is above THRESHOLD:

      SELECT lhs.term, rhs.term, r.method, rc.dice, rc.lhs_count, rc.rhs_count, rc.rule_count, r.pattern , r.rule_perc
      FROM terms lhs, rules r, terms rhs, rules_counts rc
      WHERE rhs.id = r.rhs and lhs.id = r.lhs and r.lhs = rc.lhs and r.rhs = rc.rhs and lhs.term = 'TERM ' and rc.dice > THRESHOLD ;