Open Information Extraction from the Web Oren Etzioni

Open Information Extractionfrom the WebOren Etzioni

KnowItAll Project (2003…) Rob Bart Janara Christensen Tony Fader Tom Lin Alan Ritter Michael Schmitz Dr. Niranjan Balasubramanian Dr. Stephen Soderland Prof. Mausam Prof. Dan Weld PhD alumni: Michele Banko, Prof. Michael Cafarella, Prof. Doug Downey, Ana-Maria Popescu, Stefan Schoenmackers, and Prof. Alex Yates Funding: DARPA, IARPA, NSF, ONR, Google. Etzioni, University of Washington

Outline • A “scruffy” view of Machine Reading • Open IE (overview, progress, new demo) • Critique of Open IE • Future work: Open, Open IE Etzioni, University of Washington

I. Machine Reading (Etzioni, AAAI ‘06) • “MR is an exploratory, open-ended, serendipitous process” • “In contrast with many NLP tasks, MR is inherently unsupervised” • “Very large scale” • “Forming Generalizations based on extracted assertions” No Ontology… Ontology Free! Etzioni, University of Washington

Lessons from DB/KR Research • Declarative KR is expensive & difficult • Formal semantics is at odds with • Broad scope • Distributed authorship • KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03) A fortiori, for KBs extracted from text! Etzioni, University of Washington

Machine Reading at Web Scale • A “universal ontology” is impossible • Global consistency is like world peace • Micro ontologies--scale? Interconnections? • Ontological “glass ceiling” • Limited vocabulary • Pre-determined predicates • Swamped by reading at scale! Etzioni, University of Washington

OPEN VERSUS TRADITIONAL IE II. Open vs. Traditional IE How is Open IE Possible? Etzioni, University of Washington

Semantic Tractability Hypothesis easy-to-understand subset of English • Characterized relations/arguments syntactically (Banko, ACL ’08; Fader, EMNLP ’11; Etzioni, IJCAI ‘11) • Characterization is compact, domain independent • Covers 85% of binary, verb-based relations Etzioni, University of Washington

SAMPLEOrrF EXTRACTED RELATIONS SAMPLE RELATION PHRASES Etzioni, University of Washington

NUMBER OF RELATIONS Number of Relations Etzioni, University of Washington

TEXTRUNNER TextRunner (2007) First Web-scale Open IE system Distant supervision + CRF models of relations (Arg1, Relation phrase, Arg2) 1,000,000,000 distinct extractions Etzioni, University of Washington

Relation Extraction from Web Etzioni, University of Washington

After beating the Heat, the Celticsare now the “top dog” in the NBA. (the Celtics, beat, the Heat) Open IE (2012) If he wins 5 key states, Romney will be president (counterfactual: “if he wins 5 key states”) • Open source ReVerb extractor • Synonym detection • Parser-based Ollie extractor (Mausam EMNLP ‘12) • Verbs  Nouns and more • Analyze context (beliefs, counterfactuals) • Sophistication of IE is a major focus But what about entities, types, ontologies? Etzioni, University of Washington

Towards “Ontologized” Open IE • Link arguments to Freebase (Lin, AKBC ‘12) • When possible! • Associate types with Args • No Noun Phrase Left Behind (Lin, EMNLP ‘12) Etzioni, University of Washington

System Architecture Processing Output Input Relation-independent extraction Web corpus Extractor (XYZ Corp.; acquired; Go Inc.) (oranges; contain; Vitamin C) (Einstein; was born in; Ulm) (XYZ; buyout of; Go Inc.) (Albert Einstein; born in; Ulm) (Einstein Bros.; sell; bagels) Raw tuples Synonyms, Confidence XYZ Corp. = XYZ Albert Einstein = Einstein != Einstein Bros. Assessor Acquire(XYZ Corp., Go Inc.) [7] BornIn(Albert Einstein, Ulm) [5] Sell(Einstein Bros., bagels) [1] Contain(oranges, Vitamin C) [1] Extractions Index in Lucene; Link entities Query processor DEMO Etzioni, University of Washington

III. Critique of Open IE • Lack of formal ontology/vocabulary • Inconsistent extractions • Can it support reasoning? • What’s the point of Open IE? Etzioni, University of Washington

Perspectives on Open IE • “Search Needs a Shakeup”(Etzioni, Nature ’11) • Textual Resources • Reasoning over Extractions Etzioni, University of Washington

A. New Paradigm for Search “Moving Up the Information Food Chain” (Etzioni, AAAI ‘96) Retrieval  Extraction Snippets, docs  Entities, Relations Keyword queries  Questions List of docs  Answers Essential for smartphones! (Siri meets Watson) Etzioni, University of Washington

Case Study over Yelp Reviews • Map review corpus to (attribute, value) (sushi = fresh) (parking = free) • Natural-language queries “Where’s the best sushi in Seattle?” • Sort results via sentiment analysis exquisite > very good > so, so Etzioni, University of Washington

RevMiner: Extractive Interface to 400K Yelp Reviews (Huang, UIST ’12) Revminer.com Etzioni, University of Washington

(police investigate X)  (police charge Y) B. Public Textual Resources(Leveraging Open IE) • 94MRel-grams: n-grams, but over relations in text (Balasubarmanian. AKBC’12) • 600K Relation phrases (Fader, EMNLP ‘11) • Relation Meta-data: • 50K Domain/range for relations (Ritter, ACL ‘10) • 10K Functional relations (Lin, EMNLP ‘10) • 30K learned Horn clauses (Schoenmackers, EMNLP ‘10) • CLEAN (Berant, ACL ‘12) • 10M entailment rules (coming soon) • Precision double that of DIRT See openie.cs.washington.edu Etzioni, University of Washington

C. Reasoning over Extractions Identify synonyms (Yates & Etzioni JAIR ‘09) Linear-time 1st order Horn-clause inference (Schoenmackers EMNLP ’08) 1,000,000,000 Extractions Transitive Inference (Berant ACL ’11) Learn argument types Via generative model (RitterACL ‘10) Etzioni, University of Washington

Unsupervised, probabilistic model for identifying synonyms • P(Bill Clinton = President Clinton) • Count shared (relation, arg2) • P(acquired = bought) • Relations: count shared (arg1, arg2) • Functions, mutual recursion • Next step: unify with Etzioni, University of Washington

Scalable Textual Inference Desiderata for inference: • In text  probabilistic inference • On the Web  linear in |Corpus| Argument distributions of textual relations: • Inference provably linear • Empirically linear!

Inference Scalability for Holmes

Extractions  Domain/range • Much previous work (Resnick, Pantel, etc.) • Utilize generative topic models Extractions of R  Document Domain/range of R  topics

Relations as Documents TextRunner Extractions born_in(Sergey Brin, Moscow) headquartered_in(Microsoft, Redmond) born_in(Bill Gates, Seattle) born_in(Einstein, March) founded_in(Google, 1998) headquartered_in(Google, Mountain View) born_in(Sergey Brin, 1973) founded_in(Microsoft, Albuquerque) born_in(Einstein, Ulm) founded_in(Microsoft, 1973)

Generative Story [LinkLDA, Erosheva et. al. 2004] a For each relation, randomly pick a distribution over types X born_in Y P(Topic1|born_in)=0.5 P(Topic2|born_in)=0.3 …  z1 z2 Then pick arguments based on types For each extraction, pick type for a1, a2 Two separate sets of type distributions Person born_in Location Pick a topic for arg2 Pick a topic for arg2 Pick a topic for arg2 a1 a2 N Sergey Brin born_in Moscow R  g T T h1 h2

Examples of Learned Domain/range • elect(Country, Person) • predict(Expert, Event) • download(People, Software) • invest(People, Assets) • Was-born-in(Person, Location OR Date) Etzioni, University of Washington

Summary: Trajectory of Open IE Openie.cs.washington.edu Etzioni, University of Washington

IV. Future: Open Open IE • Open input: ingest tuples from any source (Tuple, Source, Confidence) • Linked Open Output: • Extractions  Linked-open Data (LOD) cloud • Relation normalization • Use LOD best practices • Specialized reasoners Etzioni, University of Washington

Conclusions • Ontology is not necessary for reasoning • Open IE is “gracefully” ontologized • Open IE is boosting text analysis • LOD has distribution & scale (but not text) = opportunity Thank you Etzioni, University of Washington

qs • Why Open? • What’s next? • Dimensions for analyzing systems • What’s worked, what’s failed? (lessons) • What can we learn from watson? • What can we learn from db/kr? (alon) Etzioni, University of Washington

Questions • What extraction mechanism is used? • What corpus? • What input knowledge? • Role for people/manual labling • Form of the extracted knowledge? • Size/scope of extracted knowledge? • What reasoning is done? • Most unique aspect? • Biggest challenge? Etzioni, University of Washington

Scalability notes • Interoperability, distributed authorship, vs. a monolithic system • Open IE meets RDF: • Need URI’s for predicates. How to obtain? • What about errors in mapping to URI? • Ambiguity? Uncertainty? Etzioni, University of Washington

reasoning • Nell: inter-class constraints to gen negative egs Etzioni, University of Washington

Dims of scalability • Corpus size • Syn coverage over text • Sem coverage over text • Time, belief, n-ary relations, etc. • Number of entities, relations • Ability to reason • How much cpu? • How much manual effort? • Bounding, cielign effect, ontological glass ceiling Etzioni, University of Washington

Example of limiting assumptions • Nell: apple has single meaning • Single atom per entity • Global computation to add entity • Can’t be sure • LOD: • Best practice • Same-as links Etzioni, University of Washington

Risk for scalable system • Limited semantics, reasoning • No reasoning… Etzioni, University of Washington

LOD triple in aug 2011: 31,634,213,770 Etzioni, University of Washington

. The following statement appears in the last paragraph of W3C Linked Library Data Group Final Report: • . . . Linked Data follows an open-world assumption: the assumption that data cannot generally be assumed to be complete and that, in principle, more data may become available for any given entity. Etzioni, University of Washington

Etzioni, University of Washington

Entity Linking an Extraction Corpus Einstein quit his job at the patent office (8) 1. String Match 2. Prominence Priors 3. Context Match Link Score (med) (med) (med) (med) (low) 1,281 inlinks 168 inlinks 56 inlinks 101 inlinks 4,620 inlinks US Patent Office EU Patent Office Japan Patent Office Swiss Patent Office Patent (low) (low) (low) (very high) (low) (med) (low) (low) (high) (low) ∝ Wikipedia Article Texts Obtain candidates, and measure string similarity. Exact String Match = best match also consider: “Document” of the extraction’s source sentences Prominence # of links in Wikipedia to that Entity’s article Collective Linking vs One Extraction at a time Link Score is a function of (String Match Score, Prominence Prior Score, Context Match Score) e.g., String Match Score x ln(Prominence Prior Score) x Context Match Score Link Ambiguity = US Patent Office “Einstein quit his job at the patent office.” “Einstein quit his job at the patent office to become a professor.” “In 1909, Einstein quit his job at the patent office.” “Einstein quit his job at the patent office where he worked.” US Patent Office EU Patent Office Japan Patent Office cosine similarity Swiss Patent Office Patent EU Patent Office Japan Patent Office 2.53GHz computer links 15 million text arguments in ~3 days (60+ per second) Faster Higher Precision Known Aliases Alternate capitalization Edit distance Word overlap Substring/ Superstring Potential Abbreviations Swiss Patent Office Patent 2nd Top Link Score Top Link Score Etzioni, University of Washington

Q/A with Linked Extractions • Ambiguous Entities • Typed Search • Linked Resources Leverages KBs by linking textual arguments to entities found in the knowledge base. Sports that originated in China “The Titanic set sail from Southampton”” “RMS Titanic weighed about 26 kt” “The Titanic was built for safety and comfort” “The Titanic sank in 12,460 feet of water” (1,902 more …) “Titanic earned more than $1 billion worldwide” “The Titanic sank in 1912” “The Titanic was released in 1998” “Titanic represents the state-of-the-art in special effects” “Titanic was built in Belfast” (3,761 more …) “Noodles originated in China” “Printmaking originated in China” “Soy Beans originated in China” “Wushu originated in China” “Taoism originated in China” “Ping Pong originated in China” (534 more …) “Golf originated in China” “Soccer originated in China” “Karate originated in China” “Dragon Boating originated in China” (14 more …) Soccer Golf “I need to learn about Titanic the ship for my homework.” “Which sports originated in China?” Wushu Karate Dragon Boating Ping Pong Freebase Sports “Dragon Boat Racing” “Table Tennis” … … Etzioni, University of Washington

Linked Extractions support Reasoning In addition to Question Answering, Linking can also benefit: Functions [Ritter et al., 2008; Lin et al., 2010] Other Relation Properties [Popescu 2007; Lin et al., CSK 2010] Inference [Schoenmackerset al., 2008; Berantet al., 2011] Knowledge-Base Population [Dredze et al., 2010] Concept-Level Annotations [Christensen and Pasca, 2012] … basically anything using the output of extraction Other Web-based text containing Entities (e.g., Query Logs) can also be linked to enable new experiences… Etzioni, University of Washington

Challenges • Single-sentence extraction • He believed the plan will work • John Glenn was the first American in space • Obama was elected President in 2008. • American president Barack Obama asserted… • ?? Etzioni, University of Washington

Open Information Extraction from the Web Oren Etzioni

Open Information Extraction from the Web Oren Etzioni

Presentation Transcript

Open Information Extraction from Conjunctive Sentences

Information Extraction from Web Documents

Information Extraction from the World Wide Web

Open Information Extraction from the Web

Information Extraction from the World Wide Web

Information Extraction from the World Wide Web

Information Extraction from Literature

Information extraction from text

Information Extraction on the Web

Information extraction from Queries

Automating the Extraction of Domain Specific Information from the Web

Information extraction from web pages using extraction ontologies

Web scale Information Extraction

Information Extraction from Multimedia Content on the Social Web

Information Extraction from the World Wide Web

Information extraction from text

Open Information Extraction

Information extraction from text

Information extraction from web pages using extraction ontologies