Information Extraction with Conditional Random Fields

Information Extractionwith Conditional Random Fields Andrew McCallum University of Massachusetts Amherst Joint work with Aron Culotta, Wei Li and Ben Wellner Bruce Croft, James Allan, David Pinto, Xing Wei John Lafferty (CMU), Fernando Pereira (UPenn)

IE fromChinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Extracting Job Openings from the Web

A Portal for Job Openings

Data Mining the Extracted Job Information

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft..

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Richard Stallman founder What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Database Load DB Documentcollection Train extraction models Query, Search Data mining Prediction Outlier detection Decision support Label training data

Application issues Directed spidering Page filtering Sequence segmentation Segment classification Record association Co-reference Scientific issues Learning more than 100k parameters from limited and noisy training data Taking advantage of rich, interdependent features. Deciding which input feature representation to use. Efficient inference in models with many interrelated predictions. Clustering massive data sets Issues that arise

Outline a • The need. The task. • Review of Random Field Models for IE. • Recent work • New results for Table Extraction • New training procedure using Feature Induction • New extended model for Co-reference Resolution • New model for Factorial Sequence Labeling • Conclusions

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... Generates: State sequence Observation sequence O O O t - t +1 t 1 o1 o2 o3 o4 o5 o6 o7 o8 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models Given a sequence of observations: Yesterday Ben Rosenberg spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayBen Rosenbergspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Ben Rosenberg

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

Problems with Richer Representationand a Joint Model These arbitrary features are not independent. • Multiple levels of granularity (chars, words, phrases) • Multiple dependent modalities (words, formatting, layout) • Past & future Two choices: Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O t - t +1 t - t +1 t 1 t 1

Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o): • Can examine features, but not responsible for generating them. • Don’t have to explicitly model their dependencies. • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

From HMMs to CRFs Conditional Finite State Sequence Models [McCallum, Freitag & Pereira, 2000] [Lafferty, McCallum, Pereira 2001] St-1 St St+1 ... Joint ... Ot-1 Ot Ot+1 Conditional St-1 St St+1 ... Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.)

Conditional Random Fields (CRFs) [Lafferty, McCallum, Pereira 2001] St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 Markov on s, conditional dependency on o. Hammersley-Clifford theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs. Set parameters by maximum likelihood, using optimization method on dL.

Training CRFs • Methods: • iterative scaling (quite slow) • conjugate gradient (much faster) • limited-memory quasi-Newton methods, BFGS (super fast) • Each iteration with complexity comparable to standard Baum-Welch [Sha & Pereira 2002]& [Malouf 2002]

General CRFs vs. HMMs • More general and expressive modeling technique • Comparable computational efficiency for inference • Features may be arbitrary functions of any or all observations • Parameters need not fully specify generation of observations; require less training data • Easy to incorporate domain knowledge

Main Point #1 Conditional probability sequence models give great flexibility regarding features used, and have efficient dynamic-programming-based algorithms for inference.

Outline a • The need. The task. • Review of Random Field Models for IE. • Recent work • New results for Table Extraction • New training procedure using Feature Induction • New extended model for Co-reference Resolution • New model for Factorial Sequence Labeling • Conclusions a

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF • Non-Table • Table Title • Table Header • Table Data Row • Table Section Data Row • Table Footnote • ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: • Percentage of digit chars • Percentage of alpha chars • Indented • Contains 5+ consecutive spaces • Whitespace in this line aligns with prev. • ... • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - D error = 85% D error = 77% CRF w/out conjunctions 52 % 68 % CRF 95 % 92 %

Main Point #2 Conditional Random Fields were more accurate in practice than a generative model ... on a table extraction task. ... (and others) …(but still not on some others, such as Hindi NER…)

Outline a • The need. The task. • Review of Random Field Models for IE. • Recent work • New results for Table Extraction • New training procedure using Feature Induction • New extended model for Co-reference Resolution • New model for Factorial Sequence Labeling • Conclusions a a

A key remaining question for CRFs, given their freedom to include the kitchen sink's worth of features, is what features to include?

Feature Induction for CRFs [McCallum, 2003, UAI] • Begin with knowledge of atomic features, but no features yet in the model. • Consider many candidate features, including atomic and conjunctions. • Evaluate each candidate feature. • Add to the model some that are ranked highest. • Train the model.

Candidate Feature Evaluation [McCallum, 2003, UAI] Common method: Information Gain True optimization criterion: Likelihood of training data • Technical meat is in how to calculate this • efficiently for CRFs • Mean field approximation • Emphasize error instances (related to Boosting) • Newton's method to set l

Named Entity Recognition [McCallum & Li, 2003, CoNLL] CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PER Yayuk Basuki Innocent Butare ORG 3M KDP Leicestershire LOC Leicestershire Nirmal Hriday The Oval MISC Java Basque 1,000 Lakes Rally

Automatically Induced Features [McCallum & Li, 2003, CoNLL] Index Feature 0 inside-noun-phrase (ot-1) 5 stopword (ot) 20 capitalized (ot+1) 75 word=the (ot) 100 in-person-lexicon (ot-1) 200 word=in (ot+2) 500 word=Republic (ot+1) 711 word=RBI (ot) & header=BASEBALL 1027 header=CRICKET (ot) & in-English-county-lexicon (ot) 1298 company-suffix-word (firstmentiont+2) 4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1) 4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot) 4474 word=the (ot-2) & word=of (ot)

Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 # features CRFs w/out Feature Induction 75% ~1M CRFs with Feature Induction 90% ~60k based on LikelihoodGain

Noun Phrase Segmentation B I I B I I O O O Rockwell International Corp. 's Tulsa unit said it signed B I I O B I O B Ia tentative agreement extending its contract with Boeing Co. O O B I O B B I I to provide structural parts for Boeing 's 747 jetliners. Method Number of features Several million? 3.8 million -- 800 thousand 25 thousand Combination of 8 SVMs [Kudo & Matsumoto 2001] CRF with hand-engineered features [Sha & Pereira 2002] CRF with Feature Induction Statistical tie for best performance in the world

Optical Character Recognition (OCR) 16x8 images of handwritten charsfrom 150 human subjectschar sequences forming words average length 8 Train on ~600, Test on ~5500 [Taskar, Guestrin, Koller 2003 ICML] Maximum-margin Markov Networks (M3) 30 20 Classification error 10 M3 CRF SVM CRF FI LogReg M3 quad M3 cubic SVM cube SVM quad

Main Point #3 One of CRFs chief challenges---choosing the feature set---can also be addressed in a formal way, based on maximum likelihood. For CRFs (and many other methods) Information Gain is not the appropriate metric for selecting features.

Outline a • The need. The task. • Review of Random Field Models for IE. • Recent work • New results for Table Extraction • New training procedure using Feature Induction • New extended model for Co-reference Resolution • New model for Factorial Sequence Labeling • Conclusions a a a

IE in Context Create ontology Spider Filter by relevance IE Segment Classify Associate Cluster Database Load DB Documentcollection Train extraction models Query, Search Data mining Prediction Outlier detection Decision support Label training data

Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Output Input News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) Mention (4) Y/N? . . . Powell . . . . . . Mr Powell . . . N Two words in common 29 Y One word in common 13 Y "Normalized" mentions are string identical 39 Y Capitalized word in common 17 Y > 50% character tri-gram overlap 19 N < 25% character tri-gram overlap -34 Y In same sentence 9 Y Within two sentences 8 N Further than 3 sentences apart -1 Y "Hobbs Distance" < 3 11 N Number of entities in between two mentions = 0 12 N Number of entities in between two mentions > 4 -3 Y Font matches 1 Y Default -19 OVERALL SCORE = 98 > threshold=0

The Problem Pair-wise merging decisions are being made independently from each other . . . Mr Powell . . . affinity = 98 Y . . . Powell . . . N affinity = -104 They should be made in relational dependence with each other. Y affinity = 11 . . . she . . . Affinity measures are noisy and imperfect.

Issues: 1) Generative model makes it difficult to use complex features. A UCBerkeley Solution [Russell 2001], [Pasula et al 2002] (Applied to citation matching, and object correspondence in vision) N id context words id surname distance fonts gender age . . . 2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---MCMC. . . .

A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003, ICML submitted] . . . Mr Powell . . . Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 45 . . . Powell . . . Y/N Y/N -30 Y/N 11 . . . she . . .

Information Extraction with Conditional Random Fields