1 / 41

Supervised Classification of Feature-based Instances

Supervised Classification of Feature-based Instances. Simple Examples for Statistics-based Classification. Based on class-feature counts Contingency table: We will see several examples of simple models based on these statistics. C. ~ C. a. b. f. c. d. ~ f.

rosetta
Download Presentation

Supervised Classification of Feature-based Instances

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supervised Classification of Feature-based Instances

  2. Simple Examples for Statistics-based Classification • Based on class-feature counts • Contingency table: • We will see several examples of simple models based on these statistics C ~C a b f c d ~f

  3. Prepositional-Phrase Attachment • Simplified version of Hindle & Rooth (1993) [MS 8.3] • Setting: V NP-chunk PP • Moscow sent soldiers into Afghanistan • ABC breached an agreementwith XYZ • Motivation for the classification task: • Attachment is often a problem for (full) parsers • Augment shallow/chunk parsers

  4. Relevant Probabilities • P(prep|n) vs. P(prep|v) • The probability of having the preposition prep attached to an occurrence of the noun n (the verb v). • Notice: a single feature for each class • Example: P(into|send) vs. P(into|soldier) • Decision measured by the likelihood ratio: • Positive/negative λ verb/noun attachment

  5. Estimating Probabilities • Based on attachment counts from a training corpus • Maximum likelihood estimates: • How to count from an unlabeled ambiguous corpus? (Circularity problem) • Some cases are unambiguous: • The roadto London is long • Moscow sent him to Afghanistan

  6. Heuristic Bootstrapping and Ambiguous Counting • Produce initial estimates (model) by counting all unambiguous cases • Apply the initial model to all ambiguous cases; count each case under the resulting attachment if |λ| is greater than a threshold • E.g. |λ|>2, meaning one attachment is at least 4 times more likely than the other • Consider each remaining ambiguous case as a 0.5 count for each attachment. • Likely n-p and v-p pairs would “pop up” in the ambiguous counts, while incorrect attachments are likely to accumulate low counts

  7. Example Decision • Moscow sent soldiers into Afghanistan • Verb attachment is 70 times more likely

  8. Hindle & Rooth Evaluation • H&R results for a somewhat richer model: • 80% correct if we always make a choice • 91.7% precision for 55.2% recall, when requiring |λ|>3 for classification. • Notice that the probability ratio doesn’t distinguish between decisions made based on high vs. low frequencies.

  9. Possible Extensions • Consider a-priori structural preference for “low” attachment (to noun) • Consider lexical head of the PP: • I saw the bird with the telescope • I met the man with the telescope • Such additional factors can be incorporated easily, assuming their independence • Addressing more complex types of attachments, such as chains of several PP’s • Similar attachment ambiguities within noun compounds: [N [N N]] vs. [[N N] N]

  10. Classify by Best Single Feature: Decision List • Training: for each feature, measure its “entailment score ” for each class, and register the class with the highest score • Sort all features by decreasing score • Classification: for a given example, identify the highest entailment score among all “active” features, and select the appropriate class • Test all features for the class in decreasing score order, until first success  output the relevant class • Default decision: the majority class • For multiple classes per example: may apply a threshold on the feature-class entailment score • Suitable when relatively few strong features indicate class (compare to manually written rules)

  11. Example: Accent Restoration • (David Yarowsky, 1994): for French and Spanish • Classes: alternative accent restorations for words in text without accent marking • Example: côte (coast) vs. côté (side) • A variant of the general word sense disambiguation problem - “one sense per collocation” motivates using decision lists • Similar tasks: • Capitalization restoration in ALL-CAPS text • Homograph disambiguation in speech synthesis (wind as noun and verb)

  12. Accent Restoration - Features • Word form coloocation features: • Single words in window: ±1, ±k (20-50) • Word pairs at <-1,+1>, <-2,-1>, <+1,+2> (complex features) • Easy to implement

  13. Accent Restoration - Features • Local syntactic-based features (for Spanish) • Use a morphological analyzer • Lemmatized features - generalizing over inflections • POS of adjacent words as features • Some word classed (primarily time terms, to help with tense ambiguity for unaccented words in Spanish)

  14. Accent Restoration – Decision Score • Probabilities estimated from training statistics, taken from a corpus with accents • Smoothing - add small constant to all counts • Pruning: • Remove redundancies for efficiency: remove specific features that score lower than their generalization (domingo - WEEKDAY, w1w2 – w1) • Cross validation: remove features that causes more errors than correct classifications on held-out data

  15. “Add-1/Add-Constant” Smoothing

  16. Accent Restoration – Results • Agreement with accented test corpus for ambiguous words: 98% • Vs. 93% for baseline of most frequent form • Accented test corpus also includes errors • Worked well for most of the highly ambiguous cases (see random sample in next slide) • Results slightly better than Naive Bayes (weighing multiple features) • Consistent with related study on binary homograph disambiguation, where combining multiple features almost always agrees with using a single best feature • Incorporating many low-confidence features may introduce noise that would override the strong features

  17. Accent Restoration – Tough Examples

  18. (Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and threw it at the policeman ? Traditional AI-style approach Manually encoded semantic preferences/constraints Actions Weapon <object – verb> Cause_movement Bombs grenade throw drop Related Application: Anaphora Resolution

  19. Statistics can be acquired from unambiguous (non-anaphoric) occurrences in raw (English) corpus (cf. PP attachment) • Semantic confidence combined with syntactic preferences it  grenade • “Language modeling” for disambiguation Statistical Approach “Semantic” Judgment Corpus (text collection) <verb–object: throw-grenade> 20 times <verb–object: throw-pocket> 1 time

  20. I bought soap bars I bought window barssense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’) ? ? Corpus (text collection) Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 timesSense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times • Features: co-occurrence within distinguished syntactic relations • “Hidden” senses – manual labeling required(?) Word Sense Disambiguationfor Machine Translation

  21. Map ambiguous “relations” to second language (all possibilities): <noun-noun: soap-bar> 1<noun-noun: ‘cahfisat-sabon’> 20 times2<noun-noun: ‘sorag-sabon’> 0 times <noun-noun: window-bar> 1<noun-noun: ‘cahfisat-chalon’> 0 times 2<noun-noun: ‘sorag-chalon’> 15 times Hebrew Corpus Solution: Mapping to Target Language English(-English)-Hebrew Dictionary: bar1 ‘chafisa’ soap  ‘sabon’ window  ‘chalon’bar2 ‘sorag’ • Exploiting ambiguities difference • Principle – intersecting redundancies(Dagan and Itai 1994)

  22. The Selection Model • Constructed to choose (classify) the right translation for a complete relation rather than for each individual word at a time • since both words in a relation might be ambiguous, having their translations dependent upon each other • Assuming a multinomial model, under certain linguistic assumptions • The multinomial variable: a source relation • Each alternative translation of the relation is a possible outcome of the variable

  23. An Example Sentence • A Hebrew sentence with 3 ambiguous words: • The alternative translations to English:

  24. Example - Relational Representation

  25. Selection Model • We would like to use as a classification score the log of the odds ratio between the most probable relation i and all other alternatives (in particular, the second most probable one j): • Estimation is based on smoothed counts • A potential problem: the odds ratio for probabilities doesn’t reflect the absolute counts from which the probabilities were estimated. • E.g., a count of 3 vs. (smoothed) 0 • Solution: using a one sided confidence interval (lower bound) for the odds ratio

  26. Confidence Interval (for a proportion) • Given an estimate, what is the confidence that the estimate is “correct”, or at least close enough to the true value?

  27. Confidence Interval (cont.) • Approximating by normal distribution: the distribution of the sampled proportion (across samples) approaches a normal distribution for large n.

  28. Confidence Interval (cont.)

  29. Selection Model (cont.) • The distribution of the log of the odds ratio (across samples) converges to normal distribution • Selection “confidence” score for a single relation - the lower bound for the odds-ratio: • The most probable translation i for the relation is selected if Conf(i), the lower bound for the log odds ratio, exceeds θ. • Notice roles of θvs. α, and impact of n1,n2

  30. Handling Multiple Relations in a Sentence: Constraint Propagation • Compute Conf(i) for each ambiguous source relation. • Pick the source relation with highest Conf(i). If Conf(i)< θ, or if no source relations left, then stop;Otherwise,select word translations according to target relation i and remove the source relation from the list. • Propagate the translation constraints: remove any target relation that contradicts the selections made; remove source relations that now become unambiguous. • Go to step 2. • Notice similarity to the decision list algorithm

  31. Selection Algorithm Example

  32. Evaluation Results • Results - HebrewEnglish translation:Coverage: ~70% Precision within coverage: ~90% • ~20% improvement over choosing most frequent translation (95% statistical confidence for an improvement relative to this common baseline)

  33. Analysis • Correct selections capture: • Clear semantic preferences: sign/seal treaty • Lexical collocation usage: peace treaty/contract • No selection: • Mostly: no statistics for any alternative (data sparseness) • investigator/researcher of corruption • Also: similar statistics for several alternatives • Solutions: • Consult more features in remote (vs. syntactic) contextprime minister … take position/job • Class/similarity-based generalizations (corruption-crime)

  34. Analysis (cont.) • Confusing multiple sources (senses) for the same target relation: • ‘sikkuy’ (chance/prospect) ‘kattan’ (small/young)Valid (frequent) target relations: • small chance - correct • young prospect – incorrect, due to - • “Young prospect” is the translation of another Hebrew expression – ‘tikva’ (hope) ‘zeira’ (young) • The “soundness” assumption of the multinomial model is violated: • Assume counting the generated target relations corresponds to sampling the source relation, hence assuming a known 1:n mapping (also completeness – another source of errors) • Potential solutions: bilingual corpus, “reverse” translation

  35. Sense Translation Model: Summary • Classification instance: a relation with multiple words, rather than a single word at a time, to capture immediate (“circular”) dependencies. • Make local decisions, based on a single feature • Taking into account statistical confidence of decisions • Constraint propagation for multiple dependent classifications (remote dependencies) • Decision list style rational – classifying by a single high confidence evidence is simpler, and may work better, than considering all weaker evidence simultaneously • Computing statistical confidence for a combination of multiple events is difficult; easier to perform for each event at a time • Statistical classification scenario (model) constructed for the linguistic setting • Important to identify explicitly the underlying model assumptions, and to analyze the resulting errors

  36. Word Sense Disambiguation • Many words have multiple meanings • E.g, river bank, financial bank • Problem: Assign proper sense to each ambiguous word in text • Applications: • Machine translation • Information retrieval (mixed evidence) • Semantic interpretation of text

  37. Compare to POS Tagging? • Idea: Treat sense disambiguation like POS tagging, just with “semantic tags” • The problems differ: • POS tags depend on specific structural cues -mostly neighboring, and thus dependent, tags • Senses depend on semantic context – less structured, longer distance dependency many relatively independent/unstructured features

  38. Approaches • Supervised learning: Learn from a pre-tagged corpus • Dictionary-Based Learning Learn to distinguish senses from dictionary entries • Unsupervised Learning Automatically cluster word occurrences into different senses

  39. Using an Aligned Bilingual Corpus • Goal: get sense tagging cheaply • Use correlations between phrases in two languages to disambiguate E.g, interest = ‘legal share’ (acquire an interest) ‘attention’ (show interest) In German Beteiligung erwerben Interesse zeigen • For each occurrence of an ambiguous word, determine which sense applies according to the aligned translation • Limited to senses that are discriminated by the other language; suitable for disambiguation in translation • Gale, Church and Yarowsky (1992)

  40. Evaluation • Train and test on pre-tagged (or bilingual) texts • Difficult to come by • Artificial data – cheap to train and test: ‘merge’ two words to form an ‘ambiguous’ word with two ‘senses’ • E.g, replace all occurrences of door and of window with doorwindow and see if the system figures out which is which • Useful to develop sense disambiguation methods

  41. Performance Bounds • How good is (say) 83.2%?? • Evaluate performance relative to lower and upper bounds: • Baseline performance: how well does the simplest “reasonable” algorithm do? E.g., compare to selecting the most frequent sense • Human performance: what percentage of the time do people agree on classification? • Nature of the senses used impacts accuracy levels

More Related