330 likes | 459 Views
Methods for Domain-Independant Information Extraction from the Web. An Experimental Comparison [Etzioni et al., 2004]. Outline. Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion. Outline.
E N D
Methods for Domain-Independant Information Extraction from the Web An Experimental Comparison [Etzioni et al., 2004]
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Introduction • Information extraction from the web (~web mining) • A good prerequisite for this talk: Information granularity 1 information 1 locations. (job posting) 10 information100 locations (HP digital camera) 1,000 infos 100,000 locations (cities of the world) fine coarse Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Paper’ structure • Presentation of an existing WebMining system • Author’ intuition of a « Recall problem » • Proposition of three possible improvements • Definition of a metric for the « quantification of success » • Evaluation of proposed improvements Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
1 2 3 4 KnowItAll System • Autonomous, domain-independant system that extract facts, concepts, and relationships from the Web. Focus (e.g.: city) Patterns instanciation: NP1 such as NP2 = « city such as »Plural(NP1) such as NP2-List = « cities such as » Search + passage retrieval: … a city such asSudbury, at north of the Great Lakes……cities such asChicago, New York, Atlanta and Orlando … Assessor: PMI-IR Hits(Atlanta AND city) / Hits (Atlanta) Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Rule Learning (RL) • Goal: increase the recall of KnowItAll “city, such as Boston”“mega-city such as Mexico”“within a city, such as Rice University” Patterns Facts (with likelihood) PMI(Boston, city) = 0,60PMI(Mexico, city) = 0,56PMI(Rice University, city) = 0,24 Methods for Domain-Independant Information Extraction from the Web.
Rule Learning (RL) of Boston Collegethe Boston Globe a Boston Parking Spaceheadhquartered in BostonCrime in Mexico continues Mexico City Hotels headhquartered in Mexico Facts (most probable) New patterns Headhquartered in NP Methods for Domain-Independant Information Extraction from the Web.
Rule Learning (RL) • Estimating rule quality • Heuristic 1: remove all substring that appear in a single seed. • Heuristic 2: rule precision = • c is the number of time the rule match a seed • n is the number of time the rule match a known negative example • k / m is the prior estimate of the rule (PMI tests) Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Subclass Extraction (SE) • Goal: increase the recall of KnowItAll Focus: scientist Pattern: « scientist such as NP» … scientist such as Arthur Noyes … scientist such as Isaac Newton … scientist such as Sandra Steingraber Methods for Domain-Independant Information Extraction from the Web.
Subclass Extraction (SE) • Using found facts, apply the reverse pattern: «N such as Arthur Noyes » « chemist such as Arthur Noyes » « biologist such as Sandra Steingraber » • Assess subclasses by PMI trick and morphology test (« ist ») Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
List Extraction (LE) • Goal: increase the recall of KnowItAll Find web pages with set (k=4) of random facts. « chicago AND boston AND mexico AND buenos aires » repeat 5,000-10,000 times In each document, try to find « a list » Methods for Domain-Independant Information Extraction from the Web.
List Extraction (LE) Use a web page « wrapper » i.e. a classifier that identify positive nodes (element of the list) and negative nodes (all the remaining html markup) Methods for Domain-Independant Information Extraction from the Web.
List Extraction (LE) • Quality of new fact == number of list in which it appears! • PMI can also be use to assess the quality (LE+A) Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Experiments • How to calculate the recall improvement? • Cannot calculate the true recall (unknown) • Can use the size of the set of facts • But how to make sure the set is pure? • Sort facts by probability • Use only high-quality facts (e.g.: prob > 0.9) • Manually assert a sample Methods for Domain-Independant Information Extraction from the Web.
Experiments Methods for Domain-Independant Information Extraction from the Web.
Experiments Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.
Conclusion • KnowItAll is an Information extraction system (coarse IE) • The only input is a 1-word « focus » (city, scientist, movie, …) • Pattern instanciation, passage retrieval, PMI-IR test • RL, SE and LE improve extraction recall • Overall LE gives the greatest improvement • SE was notably good on the « scientist » task Methods for Domain-Independant Information Extraction from the Web.
Conclusion http://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp Methods for Domain-Independant Information Extraction from the Web.