Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan

Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan Bar Ilan University

Entailment - What is it and what is it good for? • Question Answering: • Information Retrieval: “The Beatles” luxury cars are produced in ?” Britain “Which

Lexical Entailment • Lexical Entailment rules model such lexical relations • Part of the Textual Entailment paradigm – a generic framework for semantic inference • Encompasses a variety of relations: • Synonymy: Hypertension Elevated blood-pressure • IS-A: Jim Carrey actor • Predicates: Crime and Punishment  Fyodor Dostoyevsky • Reference: Abbey Road  The Beatles

What was done so far? • Lexical database, made for computational consumption, NLP resource - WordNet • Costly, need experts, many years of development (since 1985) • Distributional similarity • Country and State share similar contexts • But also Nurse and Doctor, Bear and Tiger - Low precision • Patterns: • NP1 such as NP2 luxury car such as Jaguar • NP1 and other NP2 dogs and other domestic pets • Low coverage, mainly IS-A patterns

Our approach – Utilize Definitions • Pen: an instrument for writing or drawing with ink. • pen is-an instrument • pen used for writing / drawing • ink is part of pen • Source of definitions: • Dictionary: describes language terms, slow growth • Encyclopedia: contains knowledge, proper names, events, concepts, rapidly grow • We chose Wikipedia • Very dynamic, constantly growing and updating • Covers a vast range of domains • Gaining popularity in research - AAAI 2008 workshop

Extraction Types • Be-compliment • noun in the position of a compliment of a verb ‘be’ • All-Nouns • all nouns in the definition • different likelihood to be entailed

film subj vrel title directed by-subj by pcomp-n noun Ranking All-Nouns Rules • The likelihood of entailment depends greatly on the syntactic path connecting the title and the noun. • Path in a parsed tree • An unsupervised entailment likelihood score for a syntactic path p within a definition: • Split Def-N into Def-Ntop and Def-Nbot • Indicative for rule reliability - Def-Ntop rules’ precision is much higher than Def-Nbot’s.

Extraction Types • Redirect • noun in the position of a • Parenthesis • all nouns in the definition • Link • all nouns in the definition

Ranking Rules by Supervised Learning

Ranking Rules by Supervised Learning • An alternative approach for deciding which rules to select out of all extracted rules. • Each rule is represented by: • 6 binary features: one for each extraction type • 2 binary features: one for each side of the rule indicating whether it is NE • 2 numerical features: rule sides’ co-occurrence & count extracted • 1 numeric feature: the score of the path for Def-N extraction type • Manually annotated set used to train SVMlight • Varied the J parameter in order to obtain different recall-precision tradeoffs Extraction Types

Results and Evaluation • The obtained knowledge base include: • About 10 million rules • For comparison: Snow’s extension to WordNet includes 400,000 relations. • More than 2.4 million distinct RHSs • 18% of the rules extracted by more than one extraction type • Mostly named entities and specific concepts, as expected from encyclopedia • Two Evaluation types: • Rule-based: rule correctness relative to human judgment • Inside real application: the utility of the extracted rules for lexical expansion in keyword-based text categorization Results & Evaluations

Rule-base Evaluation • Randomly sampled 830 rules and annotated them for correctness • inter annotators agreement achieved Kappa of 0.7 • Precision: the percentage of correct rules • Est. # of correct rules: number of rules annotated as correct multiply by the sampling proportion. Results & Evaluations

Supervised Learning Evaluation • 5-fold cross validation on the annotated sample: • Although considering additional information, performance is almost identical to considering only extraction types. • Further research is needed to improve our current feature set and classification performance. Results & Evaluations

Text Categorization Evaluation • Represent a category by a feature vector of characteristic terms for it. • The characteristic terms should entail the category name. • Compare the term-based feature vector of a classified document with the feature vectors of all categories. • Assign the document to the category which yields the highest cosine similarity score (single-class classification). • 20-News Groups collection • 3 baselines: No expansions, WordNet, WikiBL, [Snow] • Also evaluated the union of Wikipedia and WordNet Results & Evaluations

Text Categorization Evaluation Results & Evaluations

Promising Directions for Future Work • Learning semantic relations in addition to Taxonomical relations (hyponym, synonyms) : • Fine-grained relations of LE is important for inference Conclusions & Future Work

Promising Directions for Future Work • Natural Types, naturally phrased entities: • 56,000 terms entail Album • 31,000 terms entail Politician • 11,000 terms entail Footballer • 20,000 terms entail Actor • 15,000 terms entail Actress • 4,000 terms entail American Actor Conclusions & Future Work

Conclusions • First large-scale rule base directed to cover LE. • Learning ontology which is a very important knowledge for reasoning systems (one of the conclusions of the first 3 RTE benchmarks). • Automatically extracting lexical entailment rules from an unstructured source • Comparable results, on a real NLP task, to a costly manually crafted resource such as WordNet. Thank You Conclusions & Future Work

Inference System t: Strong sales were shown for Abbey Road in 1969. grammar rule: passive to active Abbey Road showed strong sales in 1969. lexical entailment rule: Abbey Road  The Beatles The Beatles showed strong sales in 1969. lexico-syntactic rule: show strong sales  gain commercial success h: The Beatles gained commercial success in 1969. Textual Entailment

Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan

Extracting a Lexical Entailment Rule-base from Wikipedia Eyal Shnarch, Libby Barak, Ido Dagan

Presentation Transcript

Extracting Semantic Relationships between Wikipedia Categories

Ido Dagan Bar- Ilan University, Israel

Extracting lexical information with statistical models

Omri Barak

Barak

From Wikipedia :

BLUE- Lite : A Knowledge-Based Lexical Entailment System for RTE6

Extracting a SN spectrum from EMMI

Extracting Value from SOA

Extracting Semantic Knowledge from Wikipedia Category Names

Wikitology: A Wikipedia Derived Knowledge Base

Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan

Barak Obama

Towards a probabilistic Model for Lexical Entailment

Animations from: Wikipedia and

Extracting lexical information with statistical models

EYAL NACHUM

EYAL NACHUM

Barak Lahat