SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing Marti Hearst October 13, 2004

Today • Finish hand-built rule systems • Machine Learning approaches to information extraction • Sliding Windows • Rule-learners (older) • Feature-base ML (more recent) • IE tools

Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpus annotators are cheap (but you get what you pay for!) Two kinds of NE approaches Adapted from slides by Cunningham & Bontcheva

Baseline: list lookup approach • System that recognises only entities stored in its lists (gazetteers). • Advantages - Simple, fast, language independent, easy to retarget (just create lists) • Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity Adapted from slides by Cunningham & Bontcheva

Creating Gazetteer Lists • Online phone directories and yellow pages for person and organisation names (e.g. [Paskaleva02]) • Locations lists • US GEOnet Names Server (GNS) data – 3.9 million locations with 5.37 million names (e.g., [Manov03]) • UN site: http://unstats.un.org/unsd/citydata • Global Discovery database from Europa technologies Ltd, UK (e.g., [Ignat03]) • Automatic collection from annotated training data Adapted from slides by Cunningham & Bontcheva

Rule-based Example: FACILE • FACILE - used in MUC-7 [Black et al 98] • Uses Inxight’s LinguistiX tools for tagging and morphological analysis • Database for external information, role similar to a gazetteer • Linguistic info per token, encoded as feature vector: • Text offsets • Orthographic pattern (first/all capitals, mixed, lowercase) • Token and its normalised form • Syntax – category and features • Semantics – from database or morphological analysis • Morphological analyses • Example:(1192 1196 10 T C"Mrs.""mrs."(PROP TITLE)(ˆPER_CIV_F)(("Mrs." "Title" "Abbr")) NIL)PER_CIV_F – female civilian (from database) Adapted from slides by Cunningham & Bontcheva

FACILE • Context-sensitive rules written in special rule notation, executed by an interpreter • Writing rules in PERL is too error-prone and hard • Rules of the kind: A => B\C/D, where: • A is a set of attribute-value expressions and optional score, the attributes refer to elements of the input token feature vector • B and D are left and right context respectively and can be empty • B, C, D are sequences of attribute-value pairs and Kleene regular expression operations; variables are also supported • [syn=NP, sem=ORG] (0.9) =>\ [norm="university"],[token="of"],[sem=REGION|COUNTRY|CITY] / ; Adapted from slides by Cunningham & Bontcheva

FACILE # Rule for the mark up of person names when the first name is not # present or known from the gazetteers: e.g 'Mr J. Cass', [SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M, LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS => [SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE] \[SYN=NAME, ORTH=I|O, TOKEN=_I]?, [ORTH=C|A, SYN=PROP, TOKEN=_F]?, [SYN=NAME, ORTH=I|O, TOKEN=_I]?, [SYN=NAME, TOKEN=_M]?, [ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE] #proper name, not recognised by a rule /; Adapted from slides by Cunningham & Bontcheva

FACILE • Preference mechanism: • The rule with the highest score is preferred • Longer matches are preferred to shorter matches • Results are always one semantic categorisation of the named entity in the text • Evaluation (MUC-7 scores): • Organization: 86% precision, 66% recall • Person: 90% precision, 88% recall • Location: 81% precision, 80% recall • Dates: 93% precision, 86% recall Adapted from slides by Cunningham & Bontcheva

Extraction by Sliding Window Slide adapted from William Cohen's

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide adapted from William Cohen's

A Naïve Bayes Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it. Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) Slide adapted from William Cohen's

Naïve Bayes Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field F1 Person Name: 30% Location: 61% Start Time: 98% Slide adapted from William Cohen's

SRV: a realistic sliding-window-classifier IE system [Frietag AAAI ‘98] • What windows to consider? • all windows containing as many tokens as the shortest example, but no more tokens than the longest example • How to represent a classifier? It might: • Restrict the length of window; • Restrict the vocabulary or formatting used before/after/inside window; • Restrict the relative order of tokens; • Use inductive logic programming techniques to express all these… <title>Course Information for CS213</title> <h1>CS 213 C++ Programming</h1> Slide adapted from William Cohen's

SRV: a rule-learner for sliding-window classification • Primitive predicates used by SRV: • token(X,W), allLowerCase(W), numerical(W), … • nextToken(W,U), previousToken(W,V) • HTML-specific predicates: • inTitleTag(W), inH1Tag(W), inEmTag(W),… • emphasized(W) = “inEmTag(W) or inBTag(W) or …” • tableNextCol(W,U) = “U is some token in the column after the column W is in” • tablePreviousCol(W,V), tableRowHeader(W,T),… Slide adapted from William Cohen's

Language Input Trainer Answers Model Language Input Answers Decoder Automatic Pattern-Learning Systems

Language Input Trainer Answers Model Language Input Answers Decoder Automatic Pattern-Learning Systems • Pros: • Portable across domains • Tend to have broad coverage • Robust in the face of degraded input. • Automatically finds appropriate statistical patterns • System knowledge not needed by those who supply the domain knowledge. • Cons: • Annotated training data, and lots of it, is needed. • Isn’t necessarily better or cheaper than hand-built sol’n • Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas): • learn lexico-syntactic patterns from templates

Rapier [Califf & Mooney, AAAI-99] • Rapier learns three regex-style patterns for each slot: Pre-filler pattern  Filler pattern  Post-filler pattern Slide adapted from Chris Manning's

Features for IE Learning Systems • Part of speech: syntactic role of a specific word • Semantic Classes: Synonyms or other related words • “Price” class: price, cost, amount, … • “Month” class: January, February, March, …, December • “US State” class: Alaska, Alabama, …, Washington, Wyoming • WordNet: large on-line thesaurus containing (among other things) semantic classes Slide adapted from Chris Manning's

Rapier rule matching example “…sold to the bank for an undisclosed amount…” POS: vb pr det nn pr det jj nn SClass: price Pre-filler Filler Post-Filler 1) tag: {nn,nnp} 1) word: “undisclosed” 1) sem: price 2) list: length 2 tag: jj “…paid Honeywell an undisclosed price…” POS: vb nnp det jj nnSClass: price Slide adapted from Chris Manning's

Rapier Rules: Details • Rapier rule := • pre-filler pattern • filler pattern • post-filler pattern • pattern := subpattern + • subpattern := constraint + • constraint := • Word - exact word that must be present • Tag - matched word must have given POS tag • Class - semantic class of matched word • Can specify disjunction with “{…}” • List length N - between 0 and N words satisfying other constraints Slide adapted from Chris Manning's

Rapier’s Learning Algorithm • Input: set of training examples (list of documents annotated with “extract this substring”) • Output: set of rules • Init: Rules = a rule that exactly matches each training example • Repeat several times: • Seed: Select M examples randomly and generate the Kmost-accurate maximally-general filler-only rules(prefiller = postfiller = match anything) • Grow:Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller context • Keep:Rules = Rules  the best of the K rules – subsumed rules Slide adapted from Chris Manning's

Init maximally general rules(low precision, high recall) Seed Grow maximally specific rules(high precision, low recall) Learning example (one iteration) 2 examples:‘… located in Atlanta, Georgia…”‘… offices in Kansas City, Missouri…’ Slide adapted from Chris Manning's appropriately general rule (high precision, high recall)

Rapier results:Precision vs. # Training Examples Slide adapted from William Cohen's

Rapier: results:Recall vs. # Training Examples Slide adapted from William Cohen's

Summary: Rule-learning approaches to sliding-window classification • SRV, Rapier, and WHISK [Soderland KDD ‘97] • Representations for classifiers allow restriction of the relationships between tokens, etc • Representations are carefully chosen subsets of even more powerful representations • Use of these “heavyweight” representations is complicated, but seems to pay off in results Slide adapted from William Cohen's

Successors to MUC • CoNNL: Conference on Computational Natural Language Learning • Different topics each year • 2002, 2003: Language-independent NER • 2004: Semantic Role recognition • 2001: Identify clauses in text • 2000: Chunking boundaries • http://cnts.uia.ac.be/conll2003/ (also conll2004, conll2002…) • Sponsored by SIGNLL, the Special Interest Group on Natural Language Learning of the Association for Computational Linguistics. • ACE: Automated Content Extraction • Entity Detection and Tracking • Sponsored byNIST • http://wave.ldc.upenn.edu/Projects/ACE/ • Several others recently • See http://cnts.uia.ac.be/conll2003/ner/

CoNNL-2003 • Goal: identify boundaries and types of named entities • People, Organizations, Locations, Misc. • Experiment with incorporating external resources (Gazeteers) and unlabeled data • Data: • Using IOB notation • 4 pieces of info for each term Word POS Chunk EntityType

Details on Training/Test Sets Reuters Newswire + European Corpus Initiative Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Summary of Results • 16 systems participated • Machine Learning Techniques • Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4) • Others used once were Support Vector Machines, Conditional Random Fields, Transformation-Based learning, AdaBoost, and memory-based learning • Combining techniques often worked well • Features • Choice of features is at least as important as ML method • Top-scoring systems used many types • No one feature stands out as essential (other than words) Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Use of External Information • Improvement from using Gazeteers vs. unlabeled data nearly equal • Gazeteers less useful for German than English (higher quality) Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Precision, Recall, and F-Scores * * * * * * Not significantly different Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

Combining Results • What happens if we combine the results of all of the systems? • Used a majority-vote of 5 systems for each set • English: F = 90.30 (14% error reduction of best system) • German: F = 74.17 (6% error reduction of best system) • Top four systems in more detail …

Zhang and Johnson • Experimented with the effects of different features • Used a learning method they developed called Robust Risk Minimization • Related to the Winnow method • Used it to predict the class label ti associated with each token wi • Estimate P(ti = c| xi) for every possible class label c where xi is a feature vector associated with token i • xi can including information about previous tags • Found that the relatively simple, language independent features get you much of the way

Zhang and Johnson • Simple features include: • The tokens themselves, in window of +/- 2 • The previous 2 predicted tags • The conjunction of the previous tag and the current token • Initial capitalization of tokens, in window of +/- 2 • More elaborate features include: • Word “shape” information: initial caps, all caps, all digits, digits containing punctuation • Token prefix (len 3-4) and suffix (len 1-4) • POS • Chunking info (chunk bag-of-words at current token) • Marked up entities from training data • Other dictionaries

Language independent

Florian, Ittycheria, Jing, Zhang • Combined four machine learning algorithms • The best-performing was the Zhang & Johnson RRM • Voting algorithm • Giving them all equal-weight votes worked well • So did using the RRM algorithm to choose among them • English F-measure went from 89.94 to 91.63 • Did well with the supplied features; did even better with some complex additional features: • The output of 2 other NER systems • Trained on 1.7M annotated words in 32 categories • A list of gazetteers • Improved English F-measure to 93.9 • (21% error reduction)

Effects of Unknown Words • Florian et al. note that German is harder • Has more unknown words • All nouns are capitalized

Klein, Smarr, Nguyen, Manning • Standard approach for unknown words is to extract features like suffixes, prefixes, and capitalization • Idea: use all-character n-grams, rather than words, as the primary representation • Integrates unknown words seamlessly into the model • Improved results of their classifier by 25%

Balancing n-grams with Other Evidence • Example: “morning at Grace Road” • Need the classifiers to determine “Grace” is part of a location rather than a Person • Used Conditional Markov Model (aka Maximum Entropy Model) • Also, added other “shape” information • “20-month” -> d-x • “Italy” -> Xx

Chieu and Ng • Used a Maximum Entropy approach • Estimates probabilities based on the principle of making as few assumptions as possible • But allows specification of constraints between featurs and outcome (derived from training data) • Used a rich feature set, like those already discussed • Interesting additional features: • Lists derived from training set • “Global” features: look at how the words appeared elsewhere within the document • Doesn’t say which of these features do well

Lists Derived from Training Data • UNI: (useful unigrams) • Top 20 words that precede instances of that class • Computed using a correlation metric • UBI (useful bigrams): pairs of preceding words • CITY OF, ARRIVES IN • The bigram have higher probability of preceding the class than the unigram • CITY OF better evidence than just OF • NCS: Useful Name Class Suffixes • Tokens that frequenty terminate a class • INC, COMMITTEE

Using Other Occurrences within the Document • Zone: • Where is the token from? (headline, author, body) • Unigrams • If UNI holds for an occurrence of w elsewhere • Bigrams • If UBI holds for an occurrence of w elsewhere • Suffix • If NCS holds of an occurrence of w elsewhere • InitCaps • A way to check if a word is capitalized due to its position in the sentence or not. Also, check the first work in sequence of capitalized words. • Even News Broadcasting Corp., noted for its accurate reporting, made the erroneous announcement.

MUC Redux • Task: fill slots of templates • MUC-4 (1992) • All systems hand-engineered • One MUC-6 entry used learning; failed miserably

SIMS 290-2: Applied Natural Language Processing