1.43k likes | 1.59k Views
Learning in NLP: When can we reduce or avoid annotation cost?. Tutorial at RANLP 2003 Ido Dagan Bar Ilan University, Israel. Introduction. Motivations for learning in NLP
E N D
Learning in NLP: When can we reduce or avoid annotation cost? Tutorial at RANLP 2003 Ido Dagan Bar Ilan University, Israel
Introduction • Motivations for learning in NLP • NLP requires huge amounts of diverse types of knowledge – learning makes knowledge acquisition more feasible, automatically or semi-automatically • Much of language behavior is preferential in nature, so need to acquire both quantitative and qualitative knowledge
Introduction (cont.) • Apparently, empirical modeling obtains (so far) mostly “first-order” approximation of linguistic behavior • Often, learning models that are more complex computationally improve results only to a modest extent • Often, several learning models obtain comparable results • Proper linguistic modeling seems crucial
Information Units of Interest - Examples • Explicit units: • Documents • Lexical units: words, terms (surface/base form) • Implicit (hidden) units – human stipulation: • Word senses, name types • Document categories • Lexical syntactic units: part of speech tags • Syntactic relationships between words – parsing • Semantic concepts and relationships
Tasks and Applications • Supervised/classification: identify hidden units (concepts) of explicit units • Syntactic analysis, word sense disambiguation, name classification, categorization, … • Unsupervised: identify relationships and properties of explicit units (terms, docs) • Association, topicality, similarity, clustering • Combinations
Data and Representations • Frequencies of units • Co-occurrence frequencies • Between all relevant types of units (term-doc, term-term, term-category, sense-term, etc.) • Representations and modeling • Sequences • Feature sets/vectors
Characteristics of Learning in NLP • Very high dimensionality • Sparseness of data and relevant modeling • Addressing the basic problems of language: • Ambiguity – of concepts and features • One way to say many things • Variability • Many ways to say the same thing
Supervised Classification • Hidden concept is defined by a set of labeled training examples (category, sense) • Classification is based on entailment of the hidden concept by related elements/features • Example: two senses of “sentence”: • word, paragraph, description Sense1 • judge, court, lawyer Sense2 • Single or multiple concepts per example • Word sense vs. document categories
Supervised Tasks and Features • Typical Classification Tasks: • Lexical: Word sense disambiguation, target word selection in translation, name-type classification, accent restoration, text categorization (notice task similarity) • Syntactic: POS tagging, PP-attachment, parsing • Hybrid: anaphora resolution, information extraction • Features (“feature engineering”): • Adjacent context: words, POS, … • In various relationships – distance, syntactic • possibly generalized to classes • Other: morphological, orthographic, syntactic
Learning to Classify • Two possibilities for acquiring “entailment” relationships: • Manually: by an expert (“rules”) • time consuming, difficult – “expert system” approach • Automatically: concept is defined by a set of training examples • training quantity/quality • Training: learn entailment of concept by features of training examples (a model) • Classification: apply model to new examples
Supervised Learning Scheme “Labeled” Examples Training Algorithm Classification Model New Examples Classification Algorithm Classifications
Learning Approaches • Model-based: define entailment relations and their strengths by training algorithm • Statistical/Probabilistic: model is composed of probabilities (scores) computed from training statistics • Iterative feedback/search (neural network): start from some model, classify training examples, and correct model according to feedback • Memory-based: no training algorithm and model - classify by matching to raw training (compare to unsupervised tasks)
Motivation of Tutorial Theme: Reducing or Avoiding Manual Labeling • Basic supervised setting – requires large manually labeled training corpora • Annotation is often very expensive • Many results rely on standard training materials, which were assembled through dedicated projects and evaluation frameworks • Penn Treebank, Brown Corpus, Semcor, TREC, MUC and SenseEval evaluations, CoNLL shared tasks. • Limited applicability for settings not covered by the generic resources • Different languages, specialized domains, full scope of word senses, text categories, … • Severely hurts industrial applicability
Tutorial Scope • Obtaining some (noisy) labeled data without manual annotation • Exploiting bilingual resources • Generalizations by unsupervised methods • Bootstrapping • Unsupervised clustering as an alternative to supervised classes • Expectation-Maximization (EM) for detecting underlying structures/concepts • Selective sampling • These approaches are demonstrated for basic statistical and probabilistic learning models Some of these approaches might be perceived as unsupervised learning, though they actually address supervised tasks of identifying externally imposed classes (“unsupervised” training)
Sources • Major literature sources: • Foundations of Statistical Natural Language Processing, by Manning & Schutze, MIT Press, 2000 (2nd printing with corrections) • Articles (see bibliography) • Additional slide credits: • Prof. Shlomo Argamon, Chicago
Evaluation • Evaluation mostly based on (subjective) human judgment of relevancy/correctness • In some cases – task is objective (e.g. OCR), or evaluate by applying mathematical criteria (likelihood) • Basic measure for classification – accuracy • Cross validation – different training/test splits • In many tasks (extraction, multiple class per-instance, …) most instances are “negative”; hence using recall/precision measures, following information retrieval (IR) tradition
Evaluation: Recall/Precision • Recall: #correct extracted/total correct • Precision: #correct extracted/total extracted • Recall/precision curve - by varying the number of extracted items, assuming the items are sorted by decreasing score 1 Precision 0 1 Recall
Simple Examples for Statistics-based Classification • Based on class-feature counts – from labeled data • Contingency table: • We will see several examples of simple models based on these statistics C ~C a b f c d ~f
Prepositional-Phrase Attachment • Simplified version of Hindle & Rooth (1993) [MS 8.3] • Setting: V NP-chunk PP • Moscow sent soldiers into Afghanistan • ABC breached an agreementwith XYZ • Motivation for the classification task: • Attachment is often a problem for (full) parsers • Augment shallow/chunk parsers
Relevant Probabilities • P(prep|n) vs. P(prep|v) • The probability of having the preposition prep attached to an occurrence of the noun n (the verb v). • Notice: a single feature for each class • Example: P(into|send) vs. P(into|soldier) • Decision measured by the likelihood ratio: • Positive/negative λ verb/noun attachment
Estimating Probabilities • Based on attachment counts from a training corpus • Maximum likelihood estimates: • How to count from an unlabeled ambiguous corpus? (Circularity problem) • Some cases are unambiguous: • The roadto London is long • Moscow sent him to Afghanistan
Heuristic Bootstrapping and Ambiguous Counting • Produce initial estimates (model) by counting all unambiguous cases • Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold • E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other • Consider each remaining ambiguous case as a 0.5 count for each attachment. • Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts
Example Decision • Moscow sent soldiers into Afghanistan • Verb attachment is 70 times more likely
Hindle & Rooth Evaluation • H&R results for a somewhat richer model: • 80% correct if we always make a choice • 91.7% precision for 55.2% recall, when requiring |λ|>3 for classification. • Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies.
Possible Extensions • Consider a-priori structural preference for “low” attachment (to noun) • Consider lexical head of the PP: • I saw the bird with the telescope • I met the man with the telescope • Such additional factors can be incorporated easily, assuming their independence • Addressing more complex types of attachments, such as chains of several PP’s • Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N]
Classify by Best Single Feature: Decision List • Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score • Sort all features by decreasing score • Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class • Test all features for the class in decreasing score order, until first success output the relevant class • Default decision: the majority class • For multiple classes per example: may apply a threshold on the feature-class entailment score • Suitable when relatively few strong features indicate class (compare to manually written rules)
Example: Accent Restoration • (David Yarowsky, 1994): for French and Spanish • Classes: alternative accent restorations for words in text without accent marking • Labeled training generated from accented texts • Example: côte (coast) vs. côté (side) • A variant of the general word sense disambiguation problem - “one sense per collocation” motivates using decision lists • Similar tasks (with available training): • Capitalization restoration in ALL-CAPS text • Homograph disambiguation in speech synthesis (wind as noun and verb)
Accent Restoration - Features • Word form coloocation features: • Single words in window: ±1, ±k (20-50) • Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex features) • Easy to implement
Accent Restoration - Features • Local syntactic-based features (for Spanish) • Use a morphological analyzer • Lemmatized features - generalizing over inflections • POS of adjacent words as features • Some word classed (primarily time terms, to help with tense ambiguity for unaccented words in Spanish)
Accent Restoration – Decision Score • Probabilities estimated from training statistics, taken from a corpus with accents • Smoothing - add small constant to all counts • Pruning: • Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w1w2 – w1) • Cross validation: remove features that causes more errors than correct classifications on held-out data
Counts are obtained from a sample of the probability space: sample • Maximum Likelihood Estimate proportional to sample counts: MLE estimate – 0 probability for unobserved events • Smoothing discounts observed events, leaving probability “mass” to unobserved events: discounted estimate for observed events positive estimate for unobserved events Probabilistic Estimation - Smoothing
Accent Restoration – Results • Agreement with accented test corpus for ambiguous words: 98% • Vs. 93% for baseline of most frequent form • Accented test corpus also includes errors • Worked well for most of the highly ambiguous cases (see random sample in next slide) • Results slightly better than Naive Bayes (weighing multiple features) • Consistent with related study on binary homograph disambiguation, where combining multiple features almost always agrees with using a single best feature • Incorporating many low-confidence features may introduce noise that would override the strong features
(Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and threw it at the policeman ? Traditional AI-style approach Manually encoded semantic preferences/constraints Actions Weapon <object – verb> Cause_movement Bombs grenade throw drop Related Application: Anaphora Resolution
Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment) • Semantic confidence combined with syntactic preferences it grenade • “Language modeling” for disambiguation Statistical Approach “Semantic” Judgment Corpus (text collection) <verb–object: throw-grenade> 20 times <verb–object: throw-pocket> 1 time
Word Sense Disambiguation • Many words have multiple meanings • E.g, river bank, financial bank • Problem: Assign proper sense to each ambiguous word in text • Applications: • Machine translation • Information retrieval (mixed evidence) • Semantic interpretation of text
Approaches • Supervised learning: Learn from a pre-tagged corpus (Semcor, SenseEval) • all sense-occurrences are hidden – vs. PP and anaphora • Bilingual-based methods Obtain sense labels by mapping to another language • Dictionary-Based Learning Learn to distinguish senses based on dictionary entries • Unsupervised Learning Automatically cluster word occurrences into different senses
Using an Aligned Bilingual Corpus • Goal: get sense tagging cheaply • Use correlations between aphrases in two languages to disambiguate E.g, interest = ‘legal share’ (acquire an interest) ‘attention’ (show interest) In German Beteiligung erwerben Interesse zeigen • For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation • Limited to senses that are discriminated by the other language; suitable for disambiguation in translation • Gale, Church and Yarowsky (1992) – Bayesian model
Evaluation Settings • Train and test on pre-tagged (or bilingual) texts • Difficult to come by • Artificial data – pseudo-senses – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’ • E.g, replace all occurrences of door and of window with doorwindow and see if the system figures out which is which • Useful to develop sense disambiguation methods
Performance Bounds • How good is (say) 83.2%?? • Evaluate performance relative to lower and upper bounds: • Baseline performance: how well does the simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense • Human performance: what percentage of the time do people agree on classification? • Nature of the senses used impacts accuracy levels
I bought soap bars I bought window barssense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’) ? ? Corpus (text collection) Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 timesSense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times • Features: co-occurrence within distinguished syntactic relations • “Hidden” senses – manual labeling required(?) Word Sense Disambiguationfor Machine Translation
Map ambiguous “relations” to second language (all possibilities): <noun-noun: soap-bar> 1<noun-noun: ‘cahfisat-sabon’> 20 times2<noun-noun: ‘sorag-sabon’> 0 times <noun-noun: window-bar> 1<noun-noun: ‘cahfisat-chalon’> 0 times 2<noun-noun: ‘sorag-chalon’> 15 times Hebrew Corpus Solution: Dictonary-based Mapping to Target Language English(-English)-Hebrew Dictionary: bar1 ‘chafisa’ soap ‘sabon’ window ‘chalon’bar2 ‘sorag’ • Exploiting ambiguities difference • Principle – intersecting redundancies(Dagan and Itai 1994)
The Selection Model • Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time • since both words in a relation might be ambiguous, having their translations dependent upon each other • Assuming a multinomial model, under certain linguistic assumptions • The multinomial variable: a source relation • Each alternative translation of the relation is a possible outcome of the variable
An Example Sentence • A Hebrew sentence with 3 ambiguous words: • The alternative translations to English:
Selection Model • We would like to use as a classification score the log of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j): • Estimation is based on smoothed counts • A potential problem: the odds ratio for probabilities doesn’t reflect the absolute counts from which the probabilities were estimated. • E.g., a count of 3 vs. (smoothed) 0 • Solution: using a one sided confidence interval (lower bound) for the odds ratio
Selection Model (cont.) • The distribution of the log of the odds ratio (across samples) converges to normal distribution • Selection “confidence” score for a single relation - the lower bound for the odds-ratio: • The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ. • Notice roles of θvs. α, and impact of n1,n2
Handling Multiple Relations in a Sentence: Constraint Propagation • Compute Conf(i) for each ambiguous source relation. • Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop;Otherwise,select word translations according to target relation i and remove the source relation from the list. • Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous. • Go to step 2. • Notice similarity to the decision list algorithm