250 likes | 458 Views
Multilingual Word Sense Disambiguation using Wikipedia. Bharath Dandala ( University of North Texas ) Rada Mihalcea ( University of North Texas ) Razvan Bunescu ( Ohio University ). IJCNLP , Oct 16, 2013. Word Sense Disambiguation.
E N D
Multilingual Word Sense Disambiguation using Wikipedia Bharath Dandala (University of North Texas) Rada Mihalcea (University of North Texas) Razvan Bunescu (Ohio University) IJCNLP, Oct 16, 2013
Word Sense Disambiguation • Select the correct sense of a word based on the context: • The word bar has multiple senses: bar (establishment) bar (landform) bar (law) bar (music) bar (counter) Sumner was admitted to the bar at the age of twenty-three, and entered private practice in Boston.
Word Sense Disambiguation • Use a repository of senses such as WordNet: • Static resource, short glosses, too fine grained. • Unsupervised: • Similarity between context and sense definition or gloss. • Supervised: • Train on text manually tagged with word senses. • Limited amount of manually labeled data. • Use Wikipedia for WSD: • Large sense repository, continuously growing. • Large training dataset. • Support for multilingual WSD.
Three WSD Systems • WikiMonoSense: • Address the sense-tagged data bottleneck problem by using Wikipediahyperlinks as a source of sense annotations. • Use the sense annotated corpora to train monolingualWSD classifiers. • WikiTransSense: • The sense tagged corpus extracted for the referencelanguage is machine translated into a number of supportinglanguages. • The word alignments between the reference sentences and the supporting translations are used to generate complementary features in our first approach to multilingualWSD.
Three WSD Systems • WikiMuSense: • The reliance on machine translation (MT) is significantly reduced during training for this second approach to multilingualWSD. • Sense tagged corpora in the supportinglanguages are created through the interlinguallinksavailable in Wikipedia.
Wikipedia for WSD ’’’Palermo’’’ is a city in [[Southern Italy]], the [[capital city | capital]] of the [[autonomous area | autonomous region]] of [[Sicily]]. wiki Palermo is a city in Southern Italy, the capital of the autonomous region of Sicily. html capital (economics) capital city human capital capital (architecture) financial capital
A Monolingual Dataset through Wikipedia Links • Collect all WP titles that are linked from the anchor word bar. => Bar (law), Bar (music), Bar (establishment), … • Create a sense repository from all titles that have sufficient support in WP (ignore named entities, resolve redirects): => {Bar (law), Bar (music), Bar (establishment), Bar (counter), Bar (landform)} • Use a subset of ambiguous words from Senseval 2 & 3: • Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian(25), German(25).
The WikiMonoSense Learning Framework • For each word, use WP links as examples and train a classifier to distinguish between alternative senses: • each WP sense acts as a different label in the classification model. • Each word context is represented as a vector of features: • Current word and its part-of-speech, • Local context of three words to the left and to the right. • Parts-of-speech of the surrounding words. • Verb and noun before and after the ambiguous words. • A global context implemented through sense-specific keywords determined as a list of all words occurring at least three times in the contexts defining a certain word sense.
A Multilingual Dataset through Machine Translation • Treat each of the 4 languages as a reference language: • Use Google Translate to translate the data from the reference language into the other 3 supporting languages. • Translate into French as an additional supporting language. => each referencesentence is translated into 4 supportinglanguages.
Benefits of Machine Translation • Knowledge of the target word translation can help in disambiguation: • Two different senses of the target ambiguous word may be translated into a different word in the supporting language. • Assuming access to word alignments. • Features extracted from the translated sentence can be used to enrich the feature space: • For example, the two senses “(unit)" and (establishment)" of the English word “bar" translate to the same German word “bar". • In cases like this, words in the context of the German translation may help in identifying the correct English meaning.
The WikiTransSense Learning Framework • Extract the same type of features Φ as in WikiMonoSense. • Append features from supporting languages to vector of features from the reference language: • Φ’EN = [ΦEN | ΦSP ; ΦIT ; ΦDE ; ΦFR]. • Train a multilingual WSD classifier using the augmented feature vectors.
A Multilingual Dataset through Wikipedia Interlingua Links • Wikipedia articles on the same topic in different languages are often connected through interlingual links. • Use interlingua links to project sense repository in reference language to sense repository in supporting language. • Given reference sense repository for word “bar" in English is: • EN = {bar (establishment), bar (landform), bar (law), bar (music)} • Projected supporting sense repository in German will be: • DE = {Bar (Lokal), Sandbank, NIL, Takt (Musik)} • Use projected repositories in supporting languages to train additional WSD classifiers for reference language senses.
Two Problematic Issues for Interlingua Links • There may be reference language senses that do not have interlingua links to the supporting language: • randomly sample a number of examples for that sense in the reference language. • use GT to create examples in the supporting language. • The distribution of examples per sense in the corpus for the supporting language may be different from the corresponding distribution for the reference language: • use the distribution of reference language as the true distribution and calculate the number of examples to be considered per sense from the supporting languages using [Agirre & Martinez, 2004].
The WikiMuSense Learning Framework • Given an ambigous word in the reference language, at training time: • Train a probabilistic classifier PR for the reference language: • use the same WP sense repository developed for WikiMonoSense and WikiTtransSense. • Train a probabilistic classifier PS for each supporting language: • use the reference sense repository projected in the supporting language. • Use same types of features as in WikiMonoSense, for each classifier. • Five probabilistic classifiers: • One from the reference language (PR). • Four from the supporting languages (PS).
The WikiMuSense Learning Framework • Given an ambigous word in the reference language, at test time: • Use GT to translate reference sentence in all supporting languages. • Run probabilistic classifier PR on reference sentence and classifiers PS on supporting sentences. • Combine the 5 probabilistic outputs into one disambiguation score: • DR = the set of training examples in reference language R. • DS = the set of training examples in supporting language S. • WSD = select the sense that maximizes score P.
WikiMuSense vs. WikiTransSense • WikiMuSense significantly reduces the # of sentence translations required to create the multilingual dataset. • Features extracted from each supporting language are more diverse, as sentences are natural, as opposed to translated: • although may lead to potential mismatch between training and testing distributions.
Experimental Evaluation • Used a subset of ambiguous words from Senseval 2 & 3: • Avoid words with only one Wikpedia label. => English (30), Spanish (25), Italian(25), German(25).
Experimental Evaluation: Macro Results • WikiMonoSense better than MFS on 76 out of 105 words: • Average relative error reduction of 44%, 38%, 44%, and 28%. • WikiTransSense better than MFS on 83 out of 105 words: • Average relative error reduction over WikiMonoSense of 13.7%. • utility of using features from translated contexts. • WikiMuSense better than MFS on 89 out of 105 words: • Average relative error reduction over WikiMonoSense of 16.5%. • multilingual WP data can successfully replace MT component during training.
Varying the Amount of Supporting Language Data Dip likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language.
Varying the Amount of Supporting Language Data Peak likely due to suboptimal combination of classifiers in: [Future Work]: train weights for each supporting language. # of supporting examples = # of reference examples.
Future Work • Train weights in for each supporting language, when combining classifier outputs in WikiMuSense. • Reduce the number of translations in WikiMuSense by choosing from the 280 languages in WP those supporting languages with largest number of examples per sense. • Exploit directly the distributions used inside a MT system: • eliminate MT altogether from WikiMuSense.
Conclusion • WikiMonoSense: • Use Wikipediahyperlinks to train monolingualWSDclassifiers. • WikiTransSense: • The sense tagged corpus extracted for the referencelanguage is machine translated into a number of supportinglanguages. • Use aligned sentences to generate additional features in a first approach to multilingualWSD. • WikiMuSense: • Use Wikipedia the interlinguallinksto reduce reliance on MT. • Train and combine multiple probabilistic classifiers, in a second approach to multilingualWSD.