Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web An Experimental Comparison [Etzioni et al., 2004]

Outline • Introduction • Paper structure • KnowItAll System • Rule Learning (RL) • Subclass Extraction (SE) • List Extraction (LE) • Experiments • Conclusion Methods for Domain-Independant Information Extraction from the Web.

Introduction • Information extraction from the web (~web mining) • A good prerequisite for this talk: Information granularity 1 information 1 locations. (job posting) 10 information100 locations (HP digital camera) 1,000 infos 100,000 locations (cities of the world) fine coarse Methods for Domain-Independant Information Extraction from the Web.

Paper’ structure • Presentation of an existing WebMining system • Author’ intuition of a « Recall problem » • Proposition of three possible improvements • Definition of a metric for the « quantification of success » • Evaluation of proposed improvements Methods for Domain-Independant Information Extraction from the Web.

1 2 3 4 KnowItAll System • Autonomous, domain-independant system that extract facts, concepts, and relationships from the Web. Focus (e.g.: city) Patterns instanciation: NP1 such as NP2 = « city such as »Plural(NP1) such as NP2-List = « cities such as » Search + passage retrieval: … a city such asSudbury, at north of the Great Lakes……cities such asChicago, New York, Atlanta and Orlando … Assessor: PMI-IR  Hits(Atlanta AND city) / Hits (Atlanta) Methods for Domain-Independant Information Extraction from the Web.

Rule Learning (RL) • Goal: increase the recall of KnowItAll “city, such as Boston”“mega-city such as Mexico”“within a city, such as Rice University” Patterns Facts (with likelihood) PMI(Boston, city) = 0,60PMI(Mexico, city) = 0,56PMI(Rice University, city) = 0,24 Methods for Domain-Independant Information Extraction from the Web.

Rule Learning (RL) of Boston Collegethe Boston Globe a Boston Parking Spaceheadhquartered in BostonCrime in Mexico continues Mexico City Hotels headhquartered in Mexico Facts (most probable) New patterns Headhquartered in NP Methods for Domain-Independant Information Extraction from the Web.

Rule Learning (RL) • Estimating rule quality • Heuristic 1: remove all substring that appear in a single seed. • Heuristic 2: rule precision = • c is the number of time the rule match a seed • n is the number of time the rule match a known negative example • k / m is the prior estimate of the rule (PMI tests) Methods for Domain-Independant Information Extraction from the Web.

Subclass Extraction (SE) • Goal: increase the recall of KnowItAll Focus: scientist Pattern: « scientist such as NP» … scientist such as Arthur Noyes … scientist such as Isaac Newton … scientist such as Sandra Steingraber Methods for Domain-Independant Information Extraction from the Web.

Subclass Extraction (SE) • Using found facts, apply the reverse pattern: «N such as Arthur Noyes » « chemist such as Arthur Noyes » « biologist such as Sandra Steingraber » • Assess subclasses by PMI trick and morphology test (« ist ») Methods for Domain-Independant Information Extraction from the Web.

List Extraction (LE) • Goal: increase the recall of KnowItAll Find web pages with set (k=4) of random facts. « chicago AND boston AND mexico AND buenos aires » repeat 5,000-10,000 times In each document, try to find « a list » Methods for Domain-Independant Information Extraction from the Web.

List Extraction (LE) Use a web page « wrapper » i.e. a classifier that identify positive nodes (element of the list) and negative nodes (all the remaining html markup) Methods for Domain-Independant Information Extraction from the Web.

List Extraction (LE) • Quality of new fact == number of list in which it appears! • PMI can also be use to assess the quality (LE+A) Methods for Domain-Independant Information Extraction from the Web.

Experiments • How to calculate the recall improvement? • Cannot calculate the true recall (unknown) • Can use the size of the set of facts • But how to make sure the set is pure? • Sort facts by probability • Use only high-quality facts (e.g.: prob > 0.9) • Manually assert a sample Methods for Domain-Independant Information Extraction from the Web.

Experiments Methods for Domain-Independant Information Extraction from the Web.

Conclusion • KnowItAll is an Information extraction system (coarse IE) • The only input is a 1-word « focus » (city, scientist, movie, …) • Pattern instanciation, passage retrieval, PMI-IR test • RL, SE and LE improve extraction recall • Overall LE gives the greatest improvement • SE was notably good on the « scientist » task Methods for Domain-Independant Information Extraction from the Web.

Conclusion http://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp Methods for Domain-Independant Information Extraction from the Web.

Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web

Presentation Transcript

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison

Information Extraction from Web Documents

Graph-Based Methods for “Open Domain” Information Extraction

Graph-Based Methods for “Open Domain” Information Extraction

Information Extraction from the World Wide Web

Open Information Extraction from the Web Oren Etzioni

Open Information Extraction from the Web

Towards Domain-Independent Information Extraction from Web Tables

Information Extraction from the World Wide Web

Information Extraction from the World Wide Web

Sources of Success for Information Extraction Methods

Information Extraction on the Web

Domain Adaptation for Biomedical Information Extraction

Automating the Extraction of Domain Specific Information from the Web

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

Information Extraction from the World Wide Web

Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison

Graph-Based Methods for “Open Domain” Information Extraction

Information extraction from web pages using extraction ontologies