710 likes | 896 Views
Graph-Based Methods for “Open Domain” Information Extraction. William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University. Joint work with Richard Wang.
E N D
Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with Richard Wang
Goal: recognize people, places, companies, times, dates, … in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. “people”) Linear-chain CRFs Language- and genre-specific extractors Goal: recognize arbitrary entity sets in text Minimal info about entity class Example 1: “ICML, NIPS” Example 2: “Machine learning conferences” Semi-supervised learning from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) Graph abstraction fits many languages Traditional IE vs Open Domain IE
Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Relational set expansion • Multilingual set expansion • Iterative set expansion – from a concept name alone
History: Open-domain IE by pattern-matching (Hearst, 92) • Start with seeds: “NIPS”, “ICML” • Look thru a corpus for certain patterns: • … “at NIPS, AISTATS, KDD and other learning conferences…” • Expand from seeds to new instances • Repeat….until ___ • “on PC of KDD, SIGIR, … and…”
NIPS SNOWBIRD “…at NIPS, AISTATS, KDD and other learning conferences…” “For skiiers, NIPS, SNOWBIRD,… and…” AISTATS SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterationsmany paths ~ additional evidence Bootstrapping as graph proximity
Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas • Dynamically build the graph using queries to the web • Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages
Pentax • Sony • Kodak • Minolta • Panasonic • Casio • Leica • Fuji • Samsung • … System Architecture • Canon • Nikon • Olympus • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers
The Extractor • Learn wrappers from web documents and seeds on the fly • Utilize semi-structured documents • Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only • See ICDM 2007 paper for details
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. • Find prefix of each seed and put in reverse order: • ford1: /ecnanif”=fer a> yllareneG … • Ford2: >”drof/ /ecnanif”=fer a> yllareneG … • honda1: /ecnanif”=fer a> ot derapmoc … • Honda2: >”adnoh/ /ecnanif”=fer a> ot … • Organize these into a trie, tagging each node with a set of seeds: yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. The corresponding right contextis the longest common suffix of the corresponding seed instances. “> ”>Ford</a> sales … yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … ”>Honda</a> while … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2} </a>
Nice properties: • There are relatively few nodes in the trie: • O((#seeds)*(document length)) • You can tag every node with the complete set of seeds that it covers • You can rank of filter nodes by any predicate over this set of seeds you want: e.g., • covers all seed instances that appear on the page? • covers at least one instance of each seed? • covers at least k instances, instances with weight > w, … “> ”>Ford</a> sales … yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … ”>Honda</a> while … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2} </a>
I am noise Me too!
Differences from prior work • Fast character-level wrapper learning • Language-independent • Trie structure allows flexibility in goals • Cover one copy of each seed, cover all instances of seeds, … • Works well for semi-structured pages • Lists and tables, pull-down menus, javascript data structures, word documents, … • High-precision, low-recall data integrationvs. High-precision, low-recall information extraction
The Ranker • Rank candidate entity mentions based on “similarity” to seeds • Noisy mentions should be ranked lower • Random Walk with Restart (GW) • …?
Google’s PageRank Inlinks are “good” (recommendations) Inlinks from a “good” site are better than inlinks from a “bad” site but inlinks from sites with many outlinks are not as “good”... “Good” and “bad” are relative. web site xxx web site xxx web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy
Google’s PageRank web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to random page web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy
Google’s PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html) web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to random page • PageRank ranks pages by the amount of time the pagehopper spends on a page: • or, if there were many pagehoppers, PageRank is the expected “crowd size” web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy
Personalized PageRank (aka Random Walk with Restart) web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to particular page web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy
Personalize PageRankRandom Walk with Restart web site xxx • Imagine a “pagehopper” that always either • follows a random link, or • jumps to a particular page P0 • this ranks pages by the total number of paths connecting them to P0 • … with each path downweighted exponentially with length web site xxx web site a b c d e f g web site pdq pdq .. web site yyyy web site a b c d e f g web site yyyy
The Ranker • Rank candidate entity mentions based on “similarity” to seeds • Noisy mentions should be ranked lower • Random Walk with Restart (GW) • On what graph?
Building a Graph • A graph consists of a fixed set of… • Node Types: {seeds, document, wrapper, mention} • Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) • Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com derive “chevrolet” 22.5% “volvo chicago” 8.4% Wrapper #1 “honda” 26.1% Wrapper #3 Wrapper #4 “acura” 34.6% “bmw pittsburgh” 8.4%
Differences from prior work • Graph-based distances vs. bootstrapping • Graph constructed on-the-fly • So it’s not different? • But there is a clear principle about how to combine results from earlier/later rounds of bootstrapping • i.e., graph proximity • Fewer parameters to consider • Robust to “bad wrappers”
Evaluation Method • Mean Average Precision • Commonly used for evaluating ranked lists in IR • Contains recall and precision-oriented aspects • Sensitive to the entire ranking • Mean of average precisions for each ranked list Prec(r) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r where L = ranked list of extracted mentions, r = rank • Evaluation Procedure(per dataset) • Randomly select threetrue entities and use their first listed mentions as seeds • Expand the three seeds obtained from step 1 • Repeat steps 1 and 2 five times • Compute MAP for the five ranked lists # True Entities = total number of true entities in this dataset
Experimental Results: 3 seeds • Vary: [Extractor] + [Ranker] + [Top N URLs] • Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) • Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } • N URLs: { 100, 200, 300 }
Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06
Side by side comparisons EachMovie vs WWW NIPS vs WWW Ghahramani & Heller, NIPS 2005
Why does SEAL do so well? Free-text wrappers are only 10-15% of all wrappers learned: “Used [...] Van Pricing" “Used [...] Engines" “Bell Road [...] " “Alaska [...] dealership" “www.sunnyking[...].com"" “engine [...] used engines" “accessories, [...] parts" “is better [...] or" • Hypotheses: • More information appears in semi-structured documents than in free text • More semi-structured documents can be (partially) understood with character-level wrappers than with HTML-level wrappers
Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Iterative set expansion – from a concept name alone • Multilingual set expansion • Relational set expansion
Proposed Solution: Iterative SEAL (iSEAL)(Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… • Expands a couple of seeds • Aggregates statistics • Evaluate iSEAL using… • Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) • Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size • Five ranking methods
ISeal (Fixed Seed Size, Supervised) Initial Seeds • …Finally rank nodes byproximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use high-confidence extractions when seeds run out
Ranking Methods Random Graph Walk with Restart • H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank • L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) • Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length • Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency • Weights each item based on the number of wrappers that extract the item
Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in bootstrapping case
Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Relational set expansion • Multilingual set expansion • Iterative set expansion – from a concept name alone
Relational Set Expansion[Wang & Cohen, EMNLP 2009] • Seed examples are pairs: • E.g., audi::germany, acura::japan, • Extension: find wrappers in which pairs of seeds occur • With specific left & right contexts • In specific order (audi before germany, …) • With specific string between them • Variant of trie-based algorithm
Results First iteration Tenth iteration
Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Set expansion - from a few clean seeds • Iterative set expansion – from many noisy seeds • Relational set expansion • Multilingual set expansion • Iterative set expansion – from a concept name alone
Multilingual Set Expansion • Basic idea: • Expand in language 1 (English) with seeds s1,s2 to S1 • Expand in language 2 (Spanish) with seeds t1,t2 to T1. • Find first seed s3 in S1 that has a translation t3 in T1. • Expand in language 1 (English) with seeds s1,s2,s3 to S2 • Find first seed t4 in T1 that has a translation s4 in S2. • Expand in language 2 (Sp.) with seeds t1,t2,t3 to T2. • Continue….
Multilingual Set Expansion • What’s needed: • Set expansion in two languages • A way to decide if s is a translation of t
Multilingual Set Expansion • Submit s as a query and ask for results in language T. • Find chunks in language T in the snippets that frequently co-occur with s • Bounded by change in character set (eg English to Chinese) or punctuation • Rank chunks by combination of proximity & frequency • Consider top 3 chunks t1, t2, t3 as likely translations of s.