270 likes | 430 Views
Word Sense Disambiguation. MAS.S60 Catherine Havasi Rob Speer. Banks?. The edge of a river “I fished on the bank of the Mississippi.” A financial institution “Bank of America failed to return my call.” The building that houses the financial institution
E N D
Word Sense Disambiguation MAS.S60 Catherine Havasi Rob Speer
Banks? • The edge of a river • “I fished on the bank of the Mississippi.” • A financial institution • “Bank of America failed to return my call.” • The building that houses the financial institution • “The bank burned down last Thursday.” • A “biological repository” • “I gave blood at the blood bank”.
Word Sense Disambiguation • Most NLP tasks need WSD • “Played a lot of pool last night… my bank shot is improving!” • Usually keying to WordNet “I hit the ball with the bat.”
Types • “All words” • Guess the WN sysnet • Lexical Subset • A small number of pre-defined words • Course Word Sense • All words, but more intuitive senses
Types • “All words” • Guess the WN sysnet • Lexical Subset • A small number of pre-defined words • Coarse Word Sense • All words, but more intuitive senses IAA is 75-80% for all words task with WordNet 90% for simple binary tasks
What is a Coarse Word Sense? • How many word senses does the word “bag” have in WordNet?
What is a Coarse Word Sense? • How many word senses does the word “bag” have in WordNet? • 9 noun senses, 5 verb senses • Coarse WSD: 6 nouns, 2 verbs • A Coarse WordNet: 6,000 words (Navigli and Litkowski2006) • These distinctions are hard even for humans (Snyder and Palmer 2004) • Fine Grained IAA: 72.5% • Coarse Grained IAA: 86.4%
“Bag”: Noun • 1. A coarse sense containing: • bag (a flexible container with a single opening) • bag, handbag, pocketbook, purse (a container used for carrying money and small personal items or accessories) • bag, bagful (the quantity that a bag will hold) • bag, traveling bag, travelling bag, grip, suitcase (a portable rectangular container for carrying clothes) • 2. bag (the quantity of game taken in a particular period) • 3. base, bag (a place that the runner must touch before scoring) • 4. bag, old bag (an ugly or ill-tempered woman) • 5. udder, bag (mammary gland of bovids (cows and sheep and goats)) • 6. cup of tea, bag, dish (an activity that you like or at which you are superior)
Frequent Ingredients • Open Mind Word Expert • WordNet • eXtendedWordNet (XWN) • SemCor 3.0 (“brown1” and “brown2”) • ConceptNet
No training set, no problem • Julia Hockenmaier’s “Psudoword” evaluation • Pick two random words • Say, “banana” and “door” • Combine them together • “BananaDoor” • Replace all instances of either in your corpora with your new pseudoword • Evaluate • A bit easier…
The “Flip-flop” Method • Stephen Brown and Jonathan Rose, 1991 • Find a single feature or set of features which disambiguated the words – think the named entity recognizer
Standard Techniques • Naïve Bayes (notice a trend) • Bag of words • Priors are based on word frequencies • Unsupervised clustering techniques • Expectation Maximization (EM) • Yarowsky
Yarowsky (slides from Julia Hockenmaier)
Using OMCS • Created a blend using a large number of resources • Created an ad hoc category for a word and its surroundings in sentence • Find which word sense is most similar to category • Keep the system machinery as general as possible.
Adding Associations • ConceptNet was included in two forms: • Concept vs. feature matrices • Concept-to-concept associations • Associations help to represent topic areas • If the document mentions computer-related words, expect more computer-related word senses
“I put my money in the bank” Calculating the Right Sense
SemEval Task 7 • 14 different systems were submitted in 2007 • Baseline: Most frequent sense • Spoiler!: Our system would have placed 4th • Top three systems: • NUS-PT: parallel corpora with SVM (Chang et al, 2007) • NUS-ML: Bayesian LDA with specialized features (Chai, et al, 2007) • LCC-WSD: multiple methods approach with end-to-end system and corpora (Novichi et al, 2007)
Parallel Corpora • IMVHO the “right” way to do it. • Different words have different sense in different languages • Use parallel corpora to find those instances • Like Euro or UN proceedings
Gold standards are overrated • RadaMihalcea, 2007: “Using Wikipedia for Automatic Word Sense Disambiguation”
Lab: making a simple supervised WSD classifier • Big thanks to some guy with a blog (Jim Plush) • Training data: Wikipedia articles surrounding “Apple” (the fruit) and “Apple Inc.” • Test data: hand-classified tweets about apples and Apple products • Use familiar features + Naïve Bayes to get > 90% accuracy • Optional: use it with tweetstream to show only tweets about apples (the fruit)
Slide Thanks • James Pustejovsky, Gerard Bakx, Julie Hockenmaier • Manning and Schutze