Information Extraction, Conditional Random Fields, and Social Network Analysis

Information Extraction,Conditional Random Fields,and Social Network Analysis Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li,Andres Corrada, Xuerui Wang

Goal: Mine actionable knowledgefrom unstructured text.

foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Extracting Job Openings from the Web

A Portal for Job Openings

Job Openings: Category = High Tech Keyword = Java Location = U.S.

Data Mining the Extracted Job Information

IE fromChinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

IE from Cargo Container Ship Manifests Cargo Tracking Div. US Navy

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers

Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] [Giles et al]

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * Free Soft.. Microsoft Microsoft TITLE ORGANIZATION * founder * CEO VP * Stallman NAME Veghte Bill Gates Richard Bill

Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Prediction Outlier detection Decision support

Outline a • Examples of IE and Data Mining. • Conditional Random Fields and Feature Induction. • Joint inference: Motivation and examples • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Labeling of Distant Entities (BP by Tree Reparameterization) • Joint Co-reference Resolution (Graph Partitioning) • Joint Segmentation and Co-ref (Iterated Conditional Samples.) • Two example projects • Email, contact address book, and Social Network Analysis • Research Paper search and analysis

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S S S transitions t - 1 t t+1 ... ... observations ... Generates: State sequence Observation sequence O O O t - t +1 t 1 o1 o2 o3 o4 o5 o6 o7 o8 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayRich Caruanaspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Rich Caruana

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S S S identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part ofnoun phrase ends in “-ski” O O O t - t +1 t 1

Problems with Richer Representationand a Joint Model These arbitrary features are not independent. • Multiple levels of granularity (chars, words, phrases) • Multiple dependent modalities (words, formatting, layout) • Past & future Two choices: Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O t - t +1 t - t +1 t 1 t 1

Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o): • Can examine features, but not responsible for generating them. • Don’t have to explicitly model their dependencies. • Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St-1 St St+1 Joint ... ... Ot-1 Ot Ot+1 Conditional St-1 St St+1 ... Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on dL.

Conditional Random Fields [Lafferty, McCallum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")

Training CRFs Feature count using correct labels Feature count using predicted labels - - Smoothing penalty

Linear-chain CRFs vs. HMMs • Comparable computational efficiency for inference • Features may be arbitrary functions of any or all observations • Parameters need not fully specify generation of observations; can require less training data • Easy to incorporate domain knowledge

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004] error 40%

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF • Non-Table • Table Title • Table Header • Table Data Row • Table Section Data Row • Table Footnote • ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: • Percentage of digit chars • Percentage of alpha chars • Indented • Contains 5+ consecutive spaces • Whitespace in this line aligns with prev. • ... • Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - CRF w/out conjunctions 52 % 68 % CRF 95 % 92 %

Feature Induction for CRFs [McCallum, 2003, UAI] • Begin with knowledge of atomic features, but no features yet in the model. • Consider many candidate features, including atomic and conjunctions. • Evaluate each candidate feature. • Add to the model some that are ranked highest. • Train the model.

Candidate Feature Evaluation [McCallum, 2003, UAI] Common method: Information Gain True optimization criterion: Likelihood of training data • Technical meat is in how to calculate this • efficiently for CRFs • Mean field approximation • Emphasize error instances (related to Boosting) • Newton's method to set 

Named Entity Recognition CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PER Yayuk Basuki Innocent Butare ORG 3M KDP Cleveland LOC Cleveland Nirmal Hriday The Oval MISC Java Basque 1,000 Lakes Rally

Automatically Induced Features [McCallum & Li, 2003, CoNLL] Index Feature 0 inside-noun-phrase (ot-1) 5 stopword (ot) 20 capitalized (ot+1) 75 word=the (ot) 100 in-person-lexicon (ot-1) 200 word=in (ot+2) 500 word=Republic (ot+1) 711 word=RBI (ot) & header=BASEBALL 1027 header=CRICKET (ot) & in-English-county-lexicon (ot) 1298 company-suffix-word (firstmentiont+2) 4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1) 4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot) 4474 word=the (ot-2) & word=of (ot)

Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction 90% based on LikelihoodGain

Outline a • Examples of IE and Data Mining. • Conditional Random Fields and Feature Induction. • Joint inference: Motivation and examples • Joint Labeling of Cascaded Sequences (Belief Propagation) • Joint Labeling of Distant Entities (BP by Tree Reparameterization) • Joint Co-reference Resolution (Graph Partitioning) • Joint Segmentation and Co-ref (Iterated Conditional Samples.) • Two example projects • Email, contact address book, and Social Network Analysis • Research Paper search and analysis a

Problem: Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Documentcollection Actionableknowledge Emerging Patterns Prediction Outlier detection Decision support

Discriminatively-trained undirected graphical models Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Documentcollection Actionableknowledge Prediction Outlier detection Decision support

1. Jointly labeling cascaded sequencesFactorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words

1. Jointly labeling cascaded sequencesFactorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well.

1. Jointly labeling cascaded sequencesFactorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002]

2. Jointly labeling distant mentionsSkip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentionsSkip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002]

3. Joint co-reference among all pairsAffinity Matrix CRF “Entity resolution”“Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N -99 Y/N ~25% reduction in error on co-reference of proper nouns in newswire. 11 . . . she . . . Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] [Bansal, Blum, Chawla, 2002]

Coreference Resolution AKA "record linkage", "database record deduplication", "entity resolution", "object correspondence", "identity uncertainty" Output Input News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) Mention (4) Y/N? . . . Powell . . . . . . Mr Powell . . . N Two words in common 29 Y One word in common 13 Y "Normalized" mentions are string identical 39 Y Capitalized word in common 17 Y > 50% character tri-gram overlap 19 N < 25% character tri-gram overlap -34 Y In same sentence 9 Y Within two sentences 8 N Further than 3 sentences apart -1 Y "Hobbs Distance" < 3 11 N Number of entities in between two mentions = 0 12 N Number of entities in between two mentions > 4 -3 Y Font matches 1 Y Default -19 OVERALL SCORE = 98 > threshold=0

The Problem Pair-wise merging decisions are being made independently from each other . . . Mr Powell . . . affinity = 98 Y . . . Powell . . . N affinity = -104 They should be made in relational dependence with each other. Y affinity = 11 . . . she . . . Affinity measures are noisy and imperfect.

Information Extraction, Conditional Random Fields, and Social Network Analysis