680 likes | 717 Views
Explore graph-based learning methods for open-domain information extraction to recognize entities like people, places, companies, and dates in natural language text. Learn about supervised and semi-supervised learning approaches, bootstrapping strategies, and similarity measures for text mining.
E N D
Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University
Goal: recognize people, places, companies, times, dates, … in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. “people”) Linear-chain CRFs Language- and genre-specific extractors Goal: recognize arbitrary entity sets in text Minimal info about entity class Example 1: “ICML, NIPS” Example 2: “Machine learning conferences” Semi-supervised learning from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) Graph abstraction fits many languages Traditional IE vs Open Domain IE
Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Approach 1: A “natural” graph derived from a smaller corpus + learned similarity • Approach 2: A carefully-engineered graph derived from huge corpus
History: Open-domain IE by pattern-matching (Hearst, 92) • Start with seeds: “NIPS”, “ICML” • Look thru a corpus for certain patterns: • … “at NIPS, AISTATS, KDD and other learning conferences…” • Expand from seeds to new instances • Repeat….until ___ • “on PC of KDD, SIGIR, … and…”
NIPS SNOWBIRD “…at NIPS, AISTATS, KDD and other learning conferences…” “For skiiers, NIPS, SNOWBIRD,… and…” AISTATS SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterationsmany paths ~ additional evidence Bootstrapping as graph proximity
Outline • Open-domain IE as finding nodes “near” seeds on a graph • Approach 1: A “natural” graph derived from a smaller corpus + learned similarity • Approach 2: A carefully-engineered graph derived from huge corpus (e.g.’s above) “with” Einat Minkov (Nokia) “with” Richard Wang (CMU ?)
Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) nsubj partmod prep.with boys like playing all kinds cars det prep.of NN NN VB VB DT NN Dependency parsed sentence is a naturally represented as a tree
Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Dependency parsed corpus is “naturally” represented as a graph
Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) • Open IE Goal: • Find “coordinate terms” (eg, girl/boy, dolls/cars) in the graph, or find • Similarity measure S so S(girl,boy) is high • What about off-the-shelf similarity measures: • PPR/RWR • Hitting time • Commute time • … ?
Personalized PR/RWR A query language:Q: { , } The graph Nodes Node type Edge label Edge weight Returns a list of nodes (of type ) ranked by the graph walk probs. graph walk parameters: edge weights Θ , walk length K and reset probabilityγ. Approximate with power iteration, cut off after fixed number of iterations K. M[x,y] = Prob. of reaching y from x in one step:the edge weight from x to y, over the total outgoing weight from x. `Personalized PageRank’:reset probability biased towardsinitial distribution.
mention nsubj mention-1 mention nsubj-1 mention-1 girls girls1 like1 like like2 boys2 boys
mention nsubj mention-1 mention nsubj-1 mention-1 girls girls1 like1 like like2 boys2 boys mention nsubj partmod mention-1 mention mention-1 girls girls1 like1 playing1 playing … boys
mention nsubj mention-1 Prep.with mention-1 girls girls1 like1 playing1 dolls1 dolls Useful but not our goal here…
Learning a better similarity metric Task T (query class) Seed words (“girl”, “boy”, …) … Query q Query a Query b + Rel. answers a + Rel. answers b + Rel. answers q GRAPH WALK • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 Potential new instances of the target concept (“doll”, “child”, “toddler”, …)
Learning methods Weight tuning – weightslearned per edge type[Diligenti et al, 2005] Reranking – re-order the retrieved list using global featuresof all paths from source to destination [Minkov et al, 2006] boys dolls FEATURES • Edge label sequences nsubj.nsubj-inv nsubj partmod prep.in nsubj partmod partmod-inv nsubj-inv • Lexical unigrams • … “like”, “playing” “like”, “playing”
Vq “girls” nsubj nsubj-inv x1 partmod x2 partmod-inv prep.in x3 nsubj-inv boys Learning methods: Path-Constrained Graph Walk PCW (summary): for each node x, learn • P(xz : relevant(z) | history(Vq,x) ) • History(Vq,x) = seq of edge labels leading from Vq to x, with all histories stored in a tree boys dolls boys nsubj.nsubj-inv nsubj partmod prep.in nsubj partmod partmod-inv nsubj-inv dolls
City and person name extraction City names:Vq = {sydney, stamford, greenville, los_angeles} Person names:Vq = {carter, dave_kingman, pedro_ramos, florio} • 10 (X4 seeds) queries for each task • Train queries q1-q5 / test queries q6-q10 • Extract nodes of type NE. • GW: 6 steps, uniform/learned weights • Reranking: top 200 nodes (using learned weights) • Path trees: 20 correct / 20 incorrect; threshold 0.5
MUC City names Person names precision rank
MUC City names Person names precision rank conj-and, prep-in, nn, appos … subj, obj, poss, nn …
MUC City names Person names precision rank conj-and, prep-in, nn, appos … subj, obj, poss, nn … prep-in-inv conj-andnn-inv nn nsubj nsubj-invappos nn-inv
MUC City names Person names precision rank conj-and, prep-in, nn, appos … subj, obj, poss, nn … Prep-in-inv conj-andnn-inv nn nsubj nsubj-invappos nn-inv LEX.”based”, LEX.”downtown” LEX.”mr”, LEX.”president”
Vector-space models • Co-occurrence vectors (counts; window: +/- 2) • Dependency vectors [Padó & Lapata, Comp Ling 07] • A path value function: • Length-based value: 1 / length(path) • Relation based value: subj-5, obj-4, obl-3, gen-2, else-1 • Context selection function: • Minimal: verbal predicate-argument (length 1) • Medium: coordination, genitive construction, noun compounds (<=3) • Maximal: combinations of the above (<=4) • Similarity function: • Cosine • Lin • Only score the top nodes retrieved with reranking (~1000 overall)
GWs – Vector models MUC City names Person names precision rank • The graph-based methods are best (syntactic + learning)
GWs – Vector models MUC + AP City names Person names precision rank • The advantage of the graph based models diminishes with the amount of data. • This is hard to evaluate at high ranks (manual labeling)
Outline • Open-domain IE as finding nodes “near” seeds on a graph • Approach 1: A “natural” graph derived from a smaller corpus + learned similarity • Approach 2: A carefully-engineered graph derived from huge corpus “with” Einat Minkov (CMU Nokia) “with” Richard Wang (CMU ?)
Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas • Dynamically build the graph using queries to the web • Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages
Pentax • Sony • Kodak • Minolta • Panasonic • Casio • Leica • Fuji • Samsung • … System Architecture • Canon • Nikon • Olympus • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers
The Extractor • Learn wrappers from web documents and seeds on the fly • Utilize semi-structured documents • Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only • See ICDM 2007 paper for details
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. • Find prefix of each seed and put in reverse order: • ford1: /ecnanif”=fer a> yllareneG … • Ford2: >”drof/ /ecnanif”=fer a> yllareneG … • honda1: /ecnanif”=fer a> ot derapmoc … • Honda2: >”adnoh/ /ecnanif”=fer a> ot … • Organize these into a trie, tagging each node with a set of seeds: yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}
.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. The corresponding right contextis the longest common suffix of the corresponding seed instances. “> ”>Ford</a> sales … yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … ”>Honda</a> while … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2} </a>
I am noise Me too!
The Ranker • Rank candidate entity mentions based on “similarity” to seeds • Noisy mentions should be ranked lower • Random Walk with Restart (GW) • …as before… • What’s the graph?
Building a Graph • A graph consists of a fixed set of… • Node Types: {seeds, document, wrapper, mention} • Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) • Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com derive “chevrolet” 22.5% “volvo chicago” 8.4% Wrapper #1 “honda” 26.1% Wrapper #3 Wrapper #4 “acura” 34.6% “bmw pittsburgh” 8.4%
Evaluation Method • Mean Average Precision • Commonly used for evaluating ranked lists in IR • Contains recall and precision-oriented aspects • Sensitive to the entire ranking • Mean of average precisions for each ranked list Prec(r) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r where L = ranked list of extracted mentions, r = rank • Evaluation Procedure(per dataset) • Randomly select threetrue entities and use their first listed mentions as seeds • Expand the three seeds obtained from step 1 • Repeat steps 1 and 2 five times • Compute MAP for the five ranked lists # True Entities = total number of true entities in this dataset
Experimental Results: 3 seeds • Vary: [Extractor] + [Ranker] + [Top N URLs] • Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) • Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } • N URLs: { 100, 200, 300 }
Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06
Side by side comparisons EachMovie vs WWW NIPS vs WWW Ghahramani & Heller, NIPS 2005
Proposed Solution: Iterative SEAL (iSEAL)(Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… • Expands a couple of seeds • Aggregates statistics • Evaluate iSEAL using… • Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) • Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size • Five ranking methods
ISeal (Fixed Seed Size, Supervised) Initial Seeds • Finally rank nodes by proximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use high-confidence extractions when seeds run out
Ranking Methods Random Graph Walk with Restart • H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank • L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) • Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length • Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency • Weights each item based on the number of wrappers that extract the item
Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in bootstrapping case
Current work • Start with name of concept (e.g., “NFL teams”) • Look for (language-dependent) patterns: • “… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” • Take most frequent answers as seeds • Run bootstrapping iSEAL with seed sizes 2,3,4,4….