Information Extraction

Information Extraction Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita

Information Extraction (IE) & Integration The Extraction task: Given, • E: a set of structured elements • S: unstructured source S extract all instances of E from S • Many versions involving many source types • Actively researched in varied communities • Several tools and techniques • Several commercial applications

IE from free format text • Classical Named Entity Recognition • Extract person, location, organization names According to Robert Callahan, president of Eastern's flight attendants union, the past practice of Eastern's parent, Houston-based Texas Air Corp., has involved ultimatums to unions to accept the carrier's terms • Several applications • News tracking • Monitor events • Bio-informatics • Protein and Gene names from publications • Customer care • Part number, problem description from emails in help centers

Title Journal Year Author Volume Page Problem definition Source: concatenation of structured elements with limited reordering and some missing fields • Example: Addresses, bib records House number Zip City Building Road Area 156 Hillside ctype Scenic drive Powai Mumbai 400076 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Relation Extraction: Disease Outbreaks • Extract structured relations from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)

Information Extraction on the web

Personal Information Systems • Automatically add a bibtex entry of a paper I download • Integrate a resume in email with the candidates database Papers Files People Email Emails Web Projects Resumes

ContactPattern  RegularExpression(Email.body,”can be reached at”) PersonPhone  Precedes(Person Precedes(ContactPattern, Phone, D), D) Hand-Coded Methods • Easy to construct in many cases • e.g., to recognize prices, phone numbers, zip codes, conference names, etc. • Easier to debug & maintain • Especially if written in a “high-level” language (as is usually the case): e.g., • Easier to incorporate / reuse domain knowledge • Can be quite labor intensive to write [From Avatar]

Example of Hand-Coded Entity Tagger [Ramakrishnan. G, 2005, Slides from Doan et al., SIGMOD 2006] Rule 1 This rule will find person names with a salutation (e.g. Dr. Laura Haas) and two capitalized words <token> INITIAL</token> <token>DOT </token> <token>CAPSWORD</token> <token>CAPSWORD</token> Rule 2 This rule will find person names where two capitalized words are present in a Person dictionary <token>PERSONDICT, CAPSWORD </token> <token>PERSONDICT,CAPSWORD</token> CAPSWORD : Word starting with uppercase, second letter lowercase E.g., DeWitt will satisfy it (DEWITT will not) \p{Upper}\p{Lower}[\p{Alpha}]{1,25} DOT : The character ‘.’

Hand Coded Rule Example: Conference Name # These are subordinate patterns$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";my $ordinals="(?:$wordOrdinals|$numberOrdinals)";my $confTypes="(?:Conference|Workshop|Symposium)";my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spacesmy $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g. "VLDB Workshop ...")my $connectors="(?:on|of)";my $abbreviations="(?:\$[A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\$)"; # Conference abbreviations like "(SIGMOD'06)"# The actual pattern we search for. A typical conference name this pattern will find is# "3rd International Conference on Blah Blah Blah (ICBBB-05)"my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";############################## ################################# Given a <dbworldMessage>, look for the conference pattern##############################################################lookForPattern($dbworldMessage, $fullNamePattern);########################################################## In a given <file>, look for occurrences of <pattern># <pattern> is a regular expression#########################################################sub lookForPattern { my ($file,$pattern) = @_;

Some Hand Coded Entity Taggers • FRUMP [DeJong 82] • CIRCUS / AutoSlog [Riloff 93] • SRI FASTUS [Appelt, 1996] • MITRE Alembic (available for use) • Alias-I LingPipe (available for use) • OSMX [Embley, 2005] • DBLife [Doan et al, 2006] • Avatar [Jayram et al, 2006]

Learning models for extraction • Rule-based extractors • For each label, build two classifiers for accepting its two boundaries. • Each classifier: sequence of rules • Each rule: conjunction of predicates • E.g: If previous token a last-name, current token “.”, next token an article start of title. • Examples: Rapier, GATE, LP2 & several more • Critique of rule-based approaches • Cannot output meaningful uncertainty values • Brittle • Limited flexibility in clues that can be exploited • Not too good about combining several weak clues. • (Pros) Somewhat easier to tune.

Statistical models of IE • Generative models like HMM • Intuitive • Very restricted feature setslower accuracy • Output probabilities are highly skewed (counterpart, naïve Bayes) • Conditional discriminative models • Local models: Maximum entropy models • Global models: Conditional Random Fields. Conditional models • output meaningful probabilities, • flexible, generalize, • getting increasingly popular • State-of-the-art!

Y A C X B Z A B C 0.1 0.1 0.8 0.4 0.2 0.4 0.6 0.3 0.1 Emission probabilities Transition probabilities 0.5 0.9 0.5 0.1 0.8 0.2 dddd dd 0.8 0.2 IE with Hidden Markov Models • Probabilistic models for IE Title Author Journal Year

HMM Structure • Naïve Model: One state per element • Nested model • Each element another HMM

HMM Dictionary • For each word (=feature), associate the probability of emitting that word • Multinomial model • More advanced models with overlapping features of a word, • example, • part of speech, • capitalized or not • type: number, letter, word etc • Maximum entropy models (McCallum 2000)

Learning model parameters • When training data defines unique path through HMM • Transition probabilities • Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i • Emission probabilities • Probability of emitting symbol k from state i = number of times k generated from i number of transition from I • When training data defines multiple path: • A more general EM like algorithm (Baum-Welch)

Using the HMM to segment • Find highest probability path through the HMM. • Viterbi: quadratic dynamic programming algorithm 115 Grant street Mumbai 400070 115 Grant ……….. 400070 House House House House Road Road Road Road City City City ot ot Pin Pin Pin Pin

Comparative Evaluation • Naïve model – One state per element in the HMM • Independent HMM – One HMM per element; • Rule Learning Method – Rapier • Nested Model – Each state in the Naïve model replaced by a HMM

Results: Comparative Evaluation The Nested model does best in all three cases (from Borkar 2001)

HMM approach: summary Inter-element sequencing Intra-element sequencing Element length Characteristic words Non-overlapping tags • Outer HMM transitions • Inner HMM • Multi-state Inner HMM • Dictionary • Global optimization

Statistical models of IE • Generative models like HMM • Intuitive • Very restricted feature setslower accuracy • Output probabilities are highly skewed (counterpart, naïve Bayes) • Conditional discriminative models • Local models: Maximum entropy models • Global models: Conditional Random Fields. Conditional models • output meaningful probabilities, • flexible, generalize, • getting increasingly popular • State-of-the-art!

t x y Basic chain model for extraction My review of Fermat’s last theorem by S. Singh y1 y2 y3 y4 y5 y6 y7 y8 y9 Independent model

Features • The word as-is • Orthographic word properties • Capitalized? Digit? Ends-with-dot? • Part of speech • Noun? • Match in a dictionary • Appears in a dictionary of people names? • Appears in a list of stop-words? • Fire these for each label and • The token, • W tokens to the left or right, or • Concatenation of tokens.

t x y Basic chain model for extraction My review of Fermat’s last theorem by S. Singh y1 y2 y3 y4 y5 y6 y7 y8 y9 Global conditional model over Pr(y1,y2…y9|x)

Features • Feature vector for each position • Examples • Parameters: weight for each feature (vector) User provided previous label i-th label Word i & neighbors Machine learnt

Transforming real-world extraction • Partition label into different parts? • Independent extraction per label? Unique Other Begin Continue End

Examples: features with weights (publications). A large number

Typical numbers • Seminars announcements (CMU): • speaker, location, timings • SVMs for start-end boundaries • 250 training examples • F1: 85% speaker, location, 92% timings (Finn & Kushmerick ’04) • Jobs postings in news groups • 17 fields: title, location, company,language, etc • 150 training examples • F1: 84% overall (LP2) (Lavelli et al 04)

Publications • Cora dataset • Paper headers: Extract title,author affiliation, address,email,abstract • 94% F1 with CRFs • 76% F1 with HMMs • Paper citations: Extract title,author,date, editor,booktitle,pages,institution • 91% F1 with CRFs • 78% F1 with HMMs Peng & McCallum 2004

Information Extraction