360 likes | 451 Views
Sparse Information Extraction: Unsupervised Language Models to the Rescue. Doug Downey, Stef Schoenmackers, Oren Etzioni Turing Center University of Washington. Answering Questions on the Web. Q: Who has won a best actor Oscar for playing a villain?
E N D
Sparse Information Extraction: Unsupervised Language Models to the Rescue Doug Downey, Stef Schoenmackers, Oren Etzioni Turing Center University of Washington
Answering Questions on the Web Q: Who has won a best actor Oscar for playing a villain? Q: Which nanotechnology companies are hiring? Q: What’s the general consensus on the IBM T40? Q: What kills bacteria? . . . No single Web page contains the answer.
Open Information Extraction • Compile time: • Parse every sentence on the Web • Extract key information • Query time: • Synthesize extractions in response to queries Challenges: Topics of interest not known in advance No hand-tagged examples
TextRunner [Banko et al 2007] At compile time… …and when Thomas Edisoninvented the light bulb around the early 1900s… …end of the 19th century when Thomas Edison and Joseph Swan invented a light bulb using carbon fiber… … => Invented(Thomas Edison, light bulb)
in real time TextRunner [Banko et al 2007] invented Live demo at: www.cs.washington.edu/research/textrunner
e.g., (Thomas Edison, light bulb) Tend to be correct e.g.,(A. Church, lambda calculus)(drug companies, diseases) A mixture of correct and incorrect Problem: Sparse Extractions context
Assessing Sparse Extractions Task: Identify which sparse extractions are correct. Challenge: No hand-tagged examples. Strategy: • Build a model of how common extractions occur in text • Rank sparse extractions by fit to model • The distributional hypothesis: elements of the same relation tend to appear in similar contexts.[Brin, 1998; Riloff & Jones 1999; Agichtein & Gravano, 2000; Etzioni et al. 2005; Pasca et al. 2006; Pantel et al. 2006] Our contribution: Unsupervised language models. • Methods for mitigating sparsity • Precomputed – scalable to Open IE
The REALM Architecture RElation Assessment using Language Models Input: Set of extractions for relation R ER = {(arg11, arg21), …, (arg1M, arg2M)} • Seeds SR = s most frequent pairs in ER(assume these are correct) • Output ranking of (arg1, arg2) ER – SR by distributional similarity to each (seed1, seed2) in SR
Distributional Similarity Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2: Compare context distributions: P(wb,…, we | seed1, seed2 ) P(wb,…, we | arg1, arg2) But e – b can be large Many parameters, sparse data => inaccuracy
N-gram Language Models Computes phrase probabilities of n words: P(wi,…, wi+n-1) E.g.: P( ) > P( ) Obtained by counting over a corpus.
Distributional Similarity in REALM Two steps for assessing R(arg1, arg2) • Typechecking • Ensure arg1 and arg2 are of proper type for R MayorOf(Intel,Santa Clara) Leverages all occurrences of each arg • Relation Assessment • Ensure Ractually holds between arg1and arg2 MayorOf(Giuliani, Seattle) Both steps use pre-computed language models => Scales to Open IE
The REALM Architecture Two steps for assessing R(arg1, arg2) • Typechecking • Ensure arg1 and arg2 are of proper type for R MayorOf(Intel,Santa Clara) Leverages all occurrences of each arg • Relation Assessment • Ensure R actually holds between arg1and arg2 MayorOf(Giuliani, Seattle) Both steps use pre-computed language models => Scales to Open IE
Typechecking and HMM-T Task: For each extraction (arg1, arg2) ER, determine if arg1 and arg2 are the proper type for R. Solution: Assume seedj SRare of the proper type, and rank argj by distributional similarity to each seedj Computing Distributional Similarity: • Offline, train Hidden Markov Model (HMM) of corpus • At query time, measure distance between argj , seedj in HMM’s N-dimensional latent state space.
HMM Language Model k = 1 case: ti ti+1 ti+2 ti+3 wi wi+1 wi+2 wi+3 cities such as Seattle Offline Training: Learn P(w | t), P(ti | ti-1, …, ti-k) to maximize probability of corpus (using EM).
HMM-T Trained HMM gives “distributional summary” of each w: N-dimensional state distribution P(t | w) Typecheck each arg by comparing state distributions: Rank extractions in ascending order of f(arg)summed over arguments.
Previous n-gram technique (1) 1) Form a context vector for each extracted argument: … cities such as Chicago , Boston , But Chicago isn’t the best cities such as Chicago , Boston , Los Angeles and Chicago . … 2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005]. such as <x> , Boston Angeles and <x> . But <x> isn’t the … …
Previous n-gram technique (2) Miami:<> Twisp:<> Problems: • Vectors are large • Intersections are sparse visited X and other X and other cities when he visited X he visited X and
Compressing Context Vectors Miami:<> P(t | Miami): Latent state distribution P(t | w) • Compact (efficient – 10-50x less data retrieved) • Dense (accurate – 23-46% error reduction) t=12N
Example: N-Grams on Sparse Data Is Pickeringtonof the same type as Chicago? Chicago , Illinois Pickerington , Ohio Chicago: Pickerington: => N-grams says no, dot product is 0! <x> , Illinois <x> , Ohio … …
Example: HMM-T on Sparse Data HMM Generalizes: Chicago , Illinois Pickerington , Ohio
HMM-T Limitations Learning iterations take time proportional to (corpus size *Tk+1) T = number of latent states k = HMM order We use limited values T=20, k=3 • Sufficient for typechecking (Santa Clara is a city) • Too coarse for relation assessment(Santa Clara is where Intel is headquartered)
The REALM Architecture Two steps for assessing R(arg1, arg2) • Typechecking • Ensure arg1 and arg2 are of proper type for R MayorOf(Intel, Santa Clara) Leverages all occurrences of each arg • Relation Assessment • Ensure Ractually holds between arg1and arg2 MayorOf(Giuliani, Seattle) Both steps use pre-computed language models => Scales to Open IE
Relation Assessment • Type checking isn’t enough NY Mayor Giuliani toured downtown Seattle. • Want: How do arguments behave in relation to each other?
REL-GRAMS (1) N-gram language model: P(wi, wi-1, … wi-k) arg1, arg2 often far apart => large k (inaccurate)
REL-GRAMS (2) Relational Language Model(REL-GRAMS): For any two arguments e1, e2: P(wi, wi-1, … wi-k | wi = e1, e1 near e2) k can be small – REL-GRAMS still captures entity relationships • Mitigate sparsity with BM25 metric (from IR) Combine with HMM-T by multiplying ranks.
Experiments Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, Merged REALM vs. • TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005]) • Pattern Learning (PL) – based on Snowball [Agichtein 2000] • HMM-T and REL-GRAMS in isolation
Results Metric: Area under precision-recall curve. REALM reduces missing area by 39% over nearest competitor.
Conclusions Sparse extractions are common, even on the Web Language models can assess sparse extractions • Accurate • Scalable Future Work • Other language modeling techniques
Web Fact-Finding Who has won three or more Academy Awards?
Web Fact-Finding Problems: User has to pick the right words, often a tedious process: "world foosball champion in 1998“ – 0 hits “world foosball champion” 1998 – 2 hits, no answer What if I could just ask for P(x) in “x was world foosball champion in 1998?” How far can language modeling and the distributional hypothesis take us?
KnowItAll Hypothesis Distributional Hypothesis X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging
X and other cities he visited X and cities such as X X soundtrack Miami Twisp Star Wars X lodging KnowItAll Hypothesis Distributional Hypothesis
in real time TextRunner invent Ranked by frequency REALM improves precision of the top 20 extractions by an average of 90%.
Improving TextRunner: Example (1) “headquartered” Top 10: Tarantella, Santa Cruz International Business Machines Corporation, Armonk Mirapoint, Sunnyvale ALD, Sunnyvale PBS, Alexandria General Dynamics, Falls Church Jupitermedia Corporation, Darien Allegro, Worcester Trolltech, Oslo Corbis, Seattle TR Precision: 40% REALM Precision: 100% company, Palo Alto held company, Santa Cruz storage hardware and software, Hopkinton Northwestern Mutual, Tacoma 1997, New York City Google, Mountain View PBS, Alexandria Linux provider, Raleigh Red Hat, Raleigh TI, Dallas TR Precision: 40%
Improving TextRunner: Example (2) “conquered” Top 10: Arabs, Rhodes Arabs, Istanbul Assyrians, Mesopotamia Great, Egypt Assyrians, Kassites Arabs, Samarkand Manchus, Outer Mongolia Vandals, North Africa Arabs, Persia Moors, Lagos TR Precision: 60% REALM Precision: 90% Great, Egypt conquistador, Mexico Normans, England Arabs, North Africa Great, Persia Romans, part Romans, Greeks Rome, Greece Napoleon, Egypt Visigoths, Suevi Kingdom TR Precision: 60%