290 likes | 436 Views
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License . CS 479, Section 1: Natural Language Processing. Lecture # 19 : Word Sense Disambiguation. Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture. Announcements.
E N D
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, Section 1:Natural Language Processing Lecture #19: Word Sense Disambiguation Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.
Announcements • Project #2, Part 1 • Early: today • Due: Wednesday • Mid-term Exam • How’d it go? • Follow-up on Friday
Objectives • Review the idea behind joint, generative models. • Discuss the challenge of word sense disambiguation and approach it as a classification problem. • Explore knowledge sources (“features”) that might help us do a better job of disambiguating word senses. • Motivate the need for conditional models, trained discriminatively. • Prepare to see maximum entropy training.
Joint / Generative Models • Models for text categorization (Naïve-Bayes, class-conditional LMs) • These are Joint models == Generative models: • Break complex structure down into derivation steps • “factors” or “local models” • Each step is a categorical choice (at least for our current purposes), conditioned on specified context • How to estimate parameters from labeled data? • Collect counts andsmooth • Backbone of much of statistical NLP c START w1 w2 wn . . .
Today • Conditional models • Trained discriminatively • Motivated by the problem of word sense disambiguation
Word Senses • http://www.youtube.com/watch?v=algmKOzTSGE
Word Senses • Words have multiple distinct meanings, or senses: • plant: living plant, manufacturing plant, … • title: name of a work, ownership document, form of address, material at the start of a film, … • Many levels of sense distinctions • Homonymy: totally unrelated meanings (river bank, money bank) • Polysemy: related meanings (star in sky, star on tv) • Systematic polysemy: productive meaning extensions (organizations to their buildings) or metaphor • Sense distinctions can be extremely subtle (or not) • Granularity of senses needed depends a lot on the task • Why is it important to model word senses? • Translation, parsing, information retrieval?
Word Sense Disambiguation • Example: living plant vs. manufacturing plant • How do we tell these senses apart? • “context” • Is this just text classification using words from nearby text,where each word sense represents a class? • Solution: run our naïve-bayes classifier? • Naïve Bayes classification works OK for noun senses • 90% on classic, shockingly easy examples (line, interest, star) • 80% on senseval-1 nouns • 70% on senseval-1 verbs – harder! The plant which had previously sustained the town’s economy shut down after an extended labor strike.
Others’ Approaches to WSD • Supervised learning • Most WSD systems do some kind of supervised learning • Many competing classification techniques perform about the same • It’s all about the knowledge sources – “features” – you rely on • Problem: limited labeled training data available • Unsupervised learning • Bootstrapping (Yarowsky 95); remember “one sense per document”? • Clustering • Indirect supervision • From thesauri • From WordNet • From parallel corpora
Resources: WordNet • Hand-built (but large) hierarchy of word senses • Basically a hierarchical thesaurus
Resources • SensEval & SemEval • An ongoing WSD competition • Training / test sets for a wide range of words, difficulties, and parts-of-speech • Bake-off where lots of labs try many competing approaches • SemCor • A big chunk of the Brown corpus annotated with WordNet senses • FrameNet • Lexical database of English that is both human- and machine-readable • Annotated examples of how words are used in actual texts • Student: dictionary of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage • NLP Researcher: more than 170,000 manually annotated sentences provide a unique training dataset for semantic role labeling • Linguists: a valence dictionary, with uniquely detailed evidence for the combinatorial properties of a core set of the English vocabulary • Semantic frame: a description of a type of event, relation, or entity and the participants in it. • Other Resources • “Open Mind Common Sense”: open source fact base • Open-domain Information extraction from the web • Parallel corpora • Flat thesauri (e.g., Roget’s)
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Back to Verb WSD • Why are verbs harder? • Verbal senses are less topical • More sensitive to structure, argument choice • Verb Example: “Serve”
Hacks! Weighted Windows with NB • Distance conditioning • Some words are important only when they are nearby …. as …. point … court ………………… serve ……..… game … …. ……………………………………………… serve as…………………. • Distance weighting • Nearby words should get a larger vote: boost(i) relative position i
Better Features • There are smarter features: • Argument selectional preference: • serve NP[meals] vs. serve NP[papers] vs. serve NP[country] • Subcategorization: • [function] serve PP[as] • [enable] serve VP[to] • [tennis] serve <intransitive> • [food] serve NP {PP[to]} • Can capture poorly (but robustly) with local windows • … but we can also use a parser and get these features explicitly • Other constraints (Yarowsky, 95) • One-sense-per-discourse • only true for broad topical distinctions • One-sense-per-collocation • pretty reliable when it kicks in: manufacturing plant, flowering plant
c w1 w2 wn . . . Knowledge Sources …. point … court ………………… serve ……… game … • Can we use our friend, Naïve Bayes?
Complex Features with Naïve Bayes? • Example: • So we have a decision to make based on a set of cues: • context:jail, context:county, context:feeding, … • local-context:jail, local-context:meals • subcat:NP, direct-object-head:meals • Not conditionally independent, given class! • Not clear how to build a generative derivation for these: • Choose topic, then decide on having a transitive usage, then pick “meals” to be the object’s head, then generate other words? • How about the words that appear in multiple features? • Hard to make this work (though maybe possible) • No real reason to try Washington County jail served 11,166 meals last month - a figure that translates to feeding some 120 people three times daily for 31 days.
A Discriminative Approach • View WSD as a discrimination task • Use a conditional model: • Have to estimate categorical dist. (over senses) where there are a huge number of things to condition on • Many feature-based classification techniques exist • We tend to need and prefer methods that provide distributions over classes (why?) P(sense | context:jail, context:county, context:feeding, … local-context:jail, local-context:meals subcat:NP, direct-object-head:meals, ….)
Feature Representations • Features are functions fi of ambiguous word w and its context that indicate the occurrences of certain patterns in the context • Binary (indicator / predicate) • Count Washington County jail served 11,166 meals last month - a figure that translates to feeding some 120 people three times daily for 31 days. • context:jail = 1 • context:county = 1 context:feeding = 1 • context:game = 0 • … • local-context:jail = 1 • local-context:meals = 1 • … • subcat:NP = 1 • subcat:PP = 0 • … • object-head:meals = 1 • object-head:ball = 0
Linear Classifiers • For a pair (c, w), we take a weighted vote for each class:
Linear Classifiers • For a pair (c, w), we take a weighted vote for each class:
Linear Classifiers • For a pair (c, w), we take a weighted vote for each class: • There are many ways to set theseweights • Perceptron: • find a currently misclassified example • nudge weights in the direction of a correct classification • Other discriminative methods usually work in the same way: • try out various weights • until you maximize some objective that relies on the truth
Next • Next class: • How to estimate those weights • Maximum Entropy Models