Multilingual Word Sense Disambiguation using Wikipedia

Multilingual Word Sense Disambiguation using Wikipedia Bharath Dandala (University of North Texas) Rada Mihalcea (University of North Texas) Razvan Bunescu (Ohio University) IJCNLP, Oct 16, 2013

Word Sense Disambiguation • Select the correct sense of a word based on the context: • The word bar has multiple senses: bar (establishment) bar (landform) bar (law) bar (music) bar (counter) Sumner was admitted to the bar at the age of twenty-three, and entered private practice in Boston.

Word Sense Disambiguation • Use a repository of senses such as WordNet: • Static resource, short glosses, too fine grained. • Unsupervised: • Similarity between context and sense definition or gloss. • Supervised: • Train on text manually tagged with word senses. • Limited amount of manually labeled data. • Use Wikipedia for WSD: • Large sense repository, continuously growing. • Large training dataset. • Support for multilingual WSD.

Three WSD Systems • WikiMonoSense: • Address the sense-tagged data bottleneck problem by using Wikipediahyperlinks as a source of sense annotations. • Use the sense annotated corpora to train monolingualWSD classifiers. • WikiTransSense: • The sense tagged corpus extracted for the referencelanguage is machine translated into a number of supportinglanguages. • The word alignments between the reference sentences and the supporting translations are used to generate complementary features in our first approach to multilingualWSD.

Three WSD Systems • WikiMuSense: • The reliance on machine translation (MT) is significantly reduced during training for this second approach to multilingualWSD. • Sense tagged corpora in the supportinglanguages are created through the interlinguallinksavailable in Wikipedia.

Wikipedia for WSD ’’’Palermo’’’ is a city in [[Southern Italy]], the [[capital city | capital]] of the [[autonomous area | autonomous region]] of [[Sicily]]. wiki Palermo is a city in Southern Italy, the capital of the autonomous region of Sicily. html capital (economics) capital city human capital capital (architecture) financial capital

A Monolingual Dataset through Wikipedia Links • Collect all WP titles that are linked from the anchor word bar. => Bar (law), Bar (music), Bar (establishment), … • Create a sense repository from all titles that have sufficient support in WP (ignore named entities, resolve redirects): => {Bar (law), Bar (music), Bar (establishment), Bar (counter), Bar (landform)} • Use a subset of ambiguous words from Senseval 2 & 3: • Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian(25), German(25).

The WikiMonoSense Learning Framework • For each word, use WP links as examples and train a classifier to distinguish between alternative senses: • each WP sense acts as a different label in the classification model. • Each word context is represented as a vector of features: • Current word and its part-of-speech, • Local context of three words to the left and to the right. • Parts-of-speech of the surrounding words. • Verb and noun before and after the ambiguous words. • A global context implemented through sense-specific keywords determined as a list of all words occurring at least three times in the contexts defining a certain word sense.

A Multilingual Dataset through Machine Translation • Treat each of the 4 languages as a reference language: • Use Google Translate to translate the data from the reference language into the other 3 supporting languages. • Translate into French as an additional supporting language. => each referencesentence is translated into 4 supportinglanguages.

Benefits of Machine Translation • Knowledge of the target word translation can help in disambiguation: • Two different senses of the target ambiguous word may be translated into a different word in the supporting language. • Assuming access to word alignments. • Features extracted from the translated sentence can be used to enrich the feature space: • For example, the two senses “(unit)" and (establishment)" of the English word “bar" translate to the same German word “bar". • In cases like this, words in the context of the German translation may help in identifying the correct English meaning.

The WikiTransSense Learning Framework • Extract the same type of features Φ as in WikiMonoSense. • Append features from supporting languages to vector of features from the reference language: • Φ’EN = [ΦEN | ΦSP ; ΦIT ; ΦDE ; ΦFR]. • Train a multilingual WSD classifier using the augmented feature vectors.

A Multilingual Dataset through Wikipedia Interlingua Links • Wikipedia articles on the same topic in different languages are often connected through interlingual links. • Use interlingua links to project sense repository in reference language to sense repository in supporting language. • Given reference sense repository for word “bar" in English is: • EN = {bar (establishment), bar (landform), bar (law), bar (music)} • Projected supporting sense repository in German will be: • DE = {Bar (Lokal), Sandbank, NIL, Takt (Musik)} • Use projected repositories in supporting languages to train additional WSD classifiers for reference language senses.

Two Problematic Issues for Interlingua Links • There may be reference language senses that do not have interlingua links to the supporting language: • randomly sample a number of examples for that sense in the reference language. • use GT to create examples in the supporting language. • The distribution of examples per sense in the corpus for the supporting language may be different from the corresponding distribution for the reference language: • use the distribution of reference language as the true distribution and calculate the number of examples to be considered per sense from the supporting languages using [Agirre & Martinez, 2004].

The WikiMuSense Learning Framework • Given an ambigous word in the reference language, at training time: • Train a probabilistic classifier PR for the reference language: • use the same WP sense repository developed for WikiMonoSense and WikiTtransSense. • Train a probabilistic classifier PS for each supporting language: • use the reference sense repository projected in the supporting language. • Use same types of features as in WikiMonoSense, for each classifier. • Five probabilistic classifiers: • One from the reference language (PR). • Four from the supporting languages (PS).

The WikiMuSense Learning Framework • Given an ambigous word in the reference language, at test time: • Use GT to translate reference sentence in all supporting languages. • Run probabilistic classifier PR on reference sentence and classifiers PS on supporting sentences. • Combine the 5 probabilistic outputs into one disambiguation score: • DR = the set of training examples in reference language R. • DS = the set of training examples in supporting language S. • WSD = select the sense that maximizes score P.

WikiMuSense vs. WikiTransSense • WikiMuSense significantly reduces the # of sentence translations required to create the multilingual dataset. • Features extracted from each supporting language are more diverse, as sentences are natural, as opposed to translated: • although may lead to potential mismatch between training and testing distributions.

Experimental Evaluation • Used a subset of ambiguous words from Senseval 2 & 3: • Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian(25), German(25).

Experimental Evaluation: Macro & Micro • cdcd

Experimental Evaluation: Macro Results • WikiMonoSense better than MFS on 76 out of 105 words: • Average relative error reduction of 44%, 38%, 44%, and 28%. • WikiTransSense better than MFS on 83 out of 105 words: • Average relative error reduction over WikiMonoSense of 13.7%. • utility of using features from translated contexts. • WikiMuSense better than MFS on 89 out of 105 words: • Average relative error reduction over WikiMonoSense of 16.5%. • multilingual WP data can successfully replace MT component during training.

Varying the Number of Supporting Languages

Varying the Amount of Supporting Language Data Dip likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language.

Varying the Amount of Supporting Language Data Peak likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language. # of supporting examples = # of reference examples.

Future Work • Train weights in for each supporting language, when combining classifier outputs in WikiMuSense. • Reduce the number of translations in WikiMuSense by choosing from the 280 languages in WP those supporting languages with largest number of examples per sense. • Exploit directly the distributions used inside a MT system: • eliminate MT altogether from WikiMuSense.

Conclusion • WikiMonoSense: • Use Wikipediahyperlinks to train monolingualWSDclassifiers. • WikiTransSense: • The sense tagged corpus extracted for the referencelanguage is machine translated into a number of supportinglanguages. • Use aligned sentences to generate additional features in a first approach to multilingualWSD. • WikiMuSense: • Use Wikipedia the interlinguallinksto reduce reliance on MT. • Train and combine multiple probabilistic classifiers, in a second approach to multilingualWSD.

Questions ?

Multilingual Word Sense Disambiguation using Wikipedia

Multilingual Word Sense Disambiguation using Wikipedia

Presentation Transcript

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Relations and Word Sense Disambiguation

Collective Word Sense Disambiguation

Word Sense Disambiguation (WSD)

Word Sense Disambiguation

Word Relations and Word Sense Disambiguation

Word Sense Disambiguation

Unsupervised Word Sense Disambiguation

Word Sense Disambiguation in Queries

Word Sense Disambiguation

Using Semantic Relatedness for Word Sense Disambiguation

Improving Subcategorization Acquisition using Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation

Word Sense Disambiguation