Graph-Based Methods for Open Domain Information Extraction

Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University

Goal: recognize people, places, companies, times, dates, … in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. “people”) Linear-chain CRFs Language- and genre-specific extractors Goal: recognize arbitrary entity sets in text Minimal info about entity class Example 1: “ICML, NIPS” Example 2: “Machine learning conferences” Semi-supervised learning from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) Graph abstraction fits many languages Traditional IE vs Open Domain IE

Outline • History • Open-domain IE by pattern-matching • The bootstrapping-with-noise problem • Bootstrapping as a graph walk • Open-domain IE as finding nodes “near” seeds on a graph • Approach 1: A “natural” graph derived from a smaller corpus + learned similarity • Approach 2: A carefully-engineered graph derived from huge corpus

History: Open-domain IE by pattern-matching (Hearst, 92) • Start with seeds: “NIPS”, “ICML” • Look thru a corpus for certain patterns: • … “at NIPS, AISTATS, KDD and other learning conferences…” • Expand from seeds to new instances • Repeat….until ___ • “on PC of KDD, SIGIR, … and…”

NIPS SNOWBIRD “…at NIPS, AISTATS, KDD and other learning conferences…” “For skiiers, NIPS, SNOWBIRD,… and…” AISTATS SIGIR KDD … “on PC of KDD, SIGIR, … and…” “… AISTATS,KDD,…” shorter paths ~ earlier iterationsmany paths ~ additional evidence Bootstrapping as graph proximity

Outline • Open-domain IE as finding nodes “near” seeds on a graph • Approach 1: A “natural” graph derived from a smaller corpus + learned similarity • Approach 2: A carefully-engineered graph derived from huge corpus (e.g.’s above) “with” Einat Minkov (Nokia) “with” Richard Wang (CMU  ?)

Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) nsubj partmod prep.with boys like playing all kinds cars det prep.of NN NN VB VB DT NN Dependency parsed sentence is a naturally represented as a tree

Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Dependency parsed corpus is “naturally” represented as a graph

Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) • Open IE Goal: • Find “coordinate terms” (eg, girl/boy, dolls/cars) in the graph, or find • Similarity measure S so S(girl,boy) is high • What about off-the-shelf similarity measures: • PPR/RWR • Hitting time • Commute time • … ?

Personalized PR/RWR A query language:Q: { , } The graph Nodes Node type Edge label Edge weight Returns a list of nodes (of type ) ranked by the graph walk probs. graph walk parameters: edge weights Θ , walk length K and reset probabilityγ. Approximate with power iteration, cut off after fixed number of iterations K. M[x,y] = Prob. of reaching y from x in one step:the edge weight from x to y, over the total outgoing weight from x. `Personalized PageRank’:reset probability biased towardsinitial distribution.

mention nsubj mention-1 mention nsubj-1 mention-1 girls girls1 like1 like like2 boys2 boys

mention nsubj mention-1 mention nsubj-1 mention-1 girls girls1 like1 like like2 boys2 boys mention nsubj partmod mention-1 mention mention-1 girls girls1 like1 playing1 playing … boys

mention nsubj mention-1 Prep.with mention-1 girls girls1 like1 playing1 dolls1 dolls Useful but not our goal here…

Learning a better similarity metric Task T (query class) Seed words (“girl”, “boy”, …) … Query q Query a Query b + Rel. answers a + Rel. answers b + Rel. answers q GRAPH WALK • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 • node rank 1 • node rank 2 • node rank 3 • node rank 4 • … • node rank 10 • node rank 11 • node rank 12 • … • node rank 50 Potential new instances of the target concept (“doll”, “child”, “toddler”, …)

Learning methods Weight tuning – weightslearned per edge type[Diligenti et al, 2005] Reranking – re-order the retrieved list using global featuresof all paths from source to destination [Minkov et al, 2006] boys dolls FEATURES • Edge label sequences nsubj.nsubj-inv nsubj  partmod  prep.in nsubj  partmod  partmod-inv  nsubj-inv • Lexical unigrams • … “like”, “playing” “like”, “playing”

Vq “girls” nsubj nsubj-inv x1 partmod x2 partmod-inv prep.in x3 nsubj-inv boys Learning methods: Path-Constrained Graph Walk PCW (summary): for each node x, learn • P(xz : relevant(z) | history(Vq,x) ) • History(Vq,x) = seq of edge labels leading from Vq to x, with all histories stored in a tree boys dolls boys nsubj.nsubj-inv nsubj  partmod  prep.in nsubj  partmod  partmod-inv  nsubj-inv dolls

City and person name extraction City names:Vq = {sydney, stamford, greenville, los_angeles} Person names:Vq = {carter, dave_kingman, pedro_ramos, florio} • 10 (X4 seeds) queries for each task • Train queries q1-q5 / test queries q6-q10 • Extract nodes of type NE. • GW: 6 steps, uniform/learned weights • Reranking: top 200 nodes (using learned weights) • Path trees: 20 correct / 20 incorrect; threshold 0.5

MUC City names Person names precision rank

MUC City names Person names precision rank conj-and, prep-in, nn, appos … subj, obj, poss, nn …

MUC City names Person names precision rank conj-and, prep-in, nn, appos … subj, obj, poss, nn … prep-in-inv  conj-andnn-inv  nn nsubj  nsubj-invappos  nn-inv

MUC City names Person names precision rank conj-and, prep-in, nn, appos … subj, obj, poss, nn … Prep-in-inv  conj-andnn-inv  nn nsubj  nsubj-invappos  nn-inv LEX.”based”, LEX.”downtown” LEX.”mr”, LEX.”president”

Vector-space models • Co-occurrence vectors (counts; window: +/- 2) • Dependency vectors [Padó & Lapata, Comp Ling 07] • A path value function: • Length-based value: 1 / length(path) • Relation based value: subj-5, obj-4, obl-3, gen-2, else-1 • Context selection function: • Minimal: verbal predicate-argument (length 1) • Medium: coordination, genitive construction, noun compounds (<=3) • Maximal: combinations of the above (<=4) • Similarity function: • Cosine • Lin • Only score the top nodes retrieved with reranking (~1000 overall)

GWs – Vector models MUC City names Person names precision rank • The graph-based methods are best (syntactic + learning)

GWs – Vector models MUC + AP City names Person names precision rank • The advantage of the graph based models diminishes with the amount of data. • This is hard to evaluate at high ranks (manual labeling)

Outline • Open-domain IE as finding nodes “near” seeds on a graph • Approach 1: A “natural” graph derived from a smaller corpus + learned similarity • Approach 2: A carefully-engineered graph derived from huge corpus “with” Einat Minkov (CMU  Nokia) “with” Richard Wang (CMU  ?)

Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) • Basic ideas • Dynamically build the graph using queries to the web • Constrain the graph to be as useful as possible • Be smart about queries • Be smart about “patterns”: use clever methods for finding meaningful structure on web pages

Pentax • Sony • Kodak • Minolta • Panasonic • Casio • Leica • Fuji • Samsung • … System Architecture • Canon • Nikon • Olympus • Fetcher: download web pages from the Web that contain all the seeds • Extractor: learn wrappers from web pages • Ranker: rank entities extracted by wrappers

The Extractor • Learn wrappers from web documents and seeds on the fly • Utilize semi-structured documents • Wrappers defined at character level • Very fast • No tokenization required; thus language independent • Wrappers derived from doc d applied to d only • See ICDM 2007 paper for details

.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. • Find prefix of each seed and put in reverse order: • ford1: /ecnanif”=fer a> yllareneG … • Ford2: >”drof/ /ecnanif”=fer a> yllareneG … • honda1: /ecnanif”=fer a> ot derapmoc … • Honda2: >”adnoh/ /ecnanif”=fer a> ot … • Organize these into a trie, tagging each node with a set of seeds: yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}

.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2}

.. Generally <a ref=“finance/ford”>Ford</a> sales … compared to <a ref=“finance/honda”>Honda</a> while <a href=“finance/gm”>General Motors</a> and <a href=“finance/bentley”>Bentley</a> …. Find prefix of each seed and put in reverse order: Organize these into a trie, tagging each node with a set of seeds. A left contextfor a valid wrapper is a node tagged with one instance of each seed. The corresponding right contextis the longest common suffix of the corresponding seed instances. “> ”>Ford</a> sales … yllareneG … {f1} {f1,h1} /ecnanif”=fer a> ot derapmoc … ”>Honda</a> while … {h1} >” drof/ /ecnanif”=fer a> yllareneG.. {f2} {f1,f2,h1,h2} {f2,h2} adnoh/ /ecnanif”=fer a> ot .. {h2} </a>

I am noise Me too!

The Ranker • Rank candidate entity mentions based on “similarity” to seeds • Noisy mentions should be ranked lower • Random Walk with Restart (GW) • …as before… • What’s the graph?

Building a Graph • A graph consists of a fixed set of… • Node Types: {seeds, document, wrapper, mention} • Labeled Directed Edges: {find, derive, extract} • Each edge asserts that a binary relation r holds • Each edge has an inverse relation r-1 (graph is cyclic) • Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, “ford”, “nissan”, “toyota” Wrapper #2 find northpointcars.com extract curryauto.com derive “chevrolet” 22.5% “volvo chicago” 8.4% Wrapper #1 “honda” 26.1% Wrapper #3 Wrapper #4 “acura” 34.6% “bmw pittsburgh” 8.4%

Evaluation Datasets: closed sets

Evaluation Method • Mean Average Precision • Commonly used for evaluating ranked lists in IR • Contains recall and precision-oriented aspects • Sensitive to the entire ranking • Mean of average precisions for each ranked list Prec(r) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r where L = ranked list of extracted mentions, r = rank • Evaluation Procedure(per dataset) • Randomly select threetrue entities and use their first listed mentions as seeds • Expand the three seeds obtained from step 1 • Repeat steps 1 and 2 five times • Compute MAP for the five ranked lists # True Entities = total number of true entities in this dataset

Experimental Results: 3 seeds • Vary: [Extractor] + [Ranker] + [Top N URLs] • Extractor: • E1: Baseline Extractor (longest common context for all seed occurrences) • E2: Smarter Extractor (longest common context for 1 occurrence of each seed) • Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } • N URLs: { 100, 200, 300 }

Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06

Side by side comparisons EachMovie vs WWW NIPS vs WWW Ghahramani & Heller, NIPS 2005

A limitation of the original SEAL

Proposed Solution: Iterative SEAL (iSEAL)(Wang & Cohen, ICDM 2008) • Makes several calls to SEAL, each call… • Expands a couple of seeds • Aggregates statistics • Evaluate iSEAL using… • Two iterative processes • Supervised vs. Unsupervised (Bootstrapping) • Two seeding strategies • Fixed Seed Size vs. Increasing Seed Size • Five ranking methods

ISeal (Fixed Seed Size, Supervised) Initial Seeds • Finally rank nodes by proximity to seeds in the full graph • Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… • Variant (Bootstrap): use high-confidence extractions when seeds run out

Ranking Methods Random Graph Walk with Restart • H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank • L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) • Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005. Wrapper Length • Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency • Weights each item based on the number of wrappers that extract the item

Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in bootstrapping case

Current work • Start with name of concept (e.g., “NFL teams”) • Look for (language-dependent) patterns: • “… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” • Take most frequent answers as seeds • Run bootstrapping iSEAL with seed sizes 2,3,4,4….

Graph-Based Methods for Open Domain Information Extraction

Graph-Based Methods for Open Domain Information Extraction

Presentation Transcript

Information Extraction Based on Extraction Ontologies: Design, Deployment and Evaluation

Chapter 6: Roots: Open Methods

Sparse Matrix Methods

Language Processing for Information Extraction

Open Domain Event Extraction from Twitter

Information Extraction Lecture 7 – Relation Extraction

Open Information Extraction from the Web

Information Extraction

Equilibrium-Based Methods for Multicomponent Absorption, Stripping, Distillation, and Extraction

Information Extraction and Named Entity Recognition

Sparse Matrix Methods

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

IEPAD: Information Extraction based on Pattern Discovery

Text Mining -- Extraction Web-Based Information Architectures

Information Extraction

Ontology-based information extraction: progresses and perspectives of the Ex tool

Seed-based Generation of Personalized Bio- Ontologies for Information Extraction

Extraction and Indexing of Triplet-Based Knowledge Using Natural Language Processing

Information Extraction Lecture 8 – Ontological and Open IE