250 likes | 424 Views
JIGSAW: an Algorithm for Word Sense Disambiguation. Dipartimento di Informatica University of Bari Pierpaolo Basile (basilepp@di.uniba.it) Giovanni Semeraro (semeraro@di.uniba.it). Word Sense Disambiguation.
E N D
JIGSAW: an Algorithm for Word Sense Disambiguation Dipartimento di Informatica University of Bari Pierpaolo Basile (basilepp@di.uniba.it) Giovanni Semeraro (semeraro@di.uniba.it)
Word Sense Disambiguation • Word Sense Disambiguation (WSD) is the problem of selecting a sense for a word from a set of predefined possibilities • Sense Inventory usually comes from a dictionary or thesaurus • Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches
All Words WSD • Attempt to disambiguate all open-class words in a text: • “He put his suit over the back of the chair” • How? • Knowledge-based approaches • Use information from dictionaries • Position in a semantic network • Use discourse properties • Minimally supervised approaches • Most frequent sense
JIGSAW • Knowledge-basedWSD algorithm • Disambiguation of words in a text by exploiting WordNet senses • Combination of three different strategies to disambiguate nouns, verbs, adjectives and adverbs • Main motivation:the effectiveness of a WSD algorithm is strongly influenced by the POS tagof the target word
JIGSAW algorithm • Input:document d = {w1, w2, ... , wh} • Output: list of WordNet synsets X = {s1, s2, ... , sk} • each element si is obtained by disambiguating the target word wi • based on the information obtained from WordNet about words in the context • context C of the target word: a window of n words to the left and another n words to the right, for a total of 2n surrounding words • For each word JIGSAWadopts a different strategy based on POS tag
JIGSAW_nouns: the idea • Based on Resnik [Resnik95] algorithm for disambiguating noun groups • Given a set of nouns W={w1,w2, ... ,wn} from document d: • each wi has an associated sense inventory Si={si1, si2, ... , sik} of possible senses • Goal: assigning each wi with the most appropriate sense sihSi, according to the similarity of wi with the other nouns in W
Wcat={cat#1,cat#2} T={mouse#1,mouse#2} Wcat={02037721,00847815} T={02244530,03651364} Mouse#1: any of numerous small rodents… 0.726 Cat#1: feline mammal… 0.0 cat 0.0 mouse Mouse#2: a hand-operated electronic device … 0.107 Cat#2: computerized axial tomography… JIGSAW_nouns: semantic similarity “The white cat is hunting the mouse” w = cat C = {mouse}
JIGSAW_nouns: MSS support MostSpecificSubsumer {placental_mammal} • MostSpecificSubsumer between words • Give more importance to senses that are hyponym of MSS • Combine MSS support with semantic similarity Wcat={cat#1,cat#2} T={mouse#1,mouse#2}
Difference between JIGSAW_nouns and Resnik • Leacock-Chodorow measure to calculate similarity (instead Information Content) • a Gaussian factor G, which takes into account the distance between words in the text • a factor R, which takes into account the synset frequency score in WordNet • a parameterized search for the MSS (Most Specific Subsumer) between two concepts
JIGSAW_verbs: the idea • Try to establish a relation between verbs and nouns (distinct IS-A hierarchies in WordNet) • Verb wi disambiguated using: • nouns in the context C of wi • nouns into the description (gloss + WordNet usage examples) of each candidate synset for wi
JIGSAW_verbs: algorithm [1/4] • For each candidate synset sik of wi • computes nouns(i, k): the set of nouns in the description for sik • for each wj in C and each synset sik computes the highest similarity maxjk • maxjk is the highest similarity value for wj wrt the nouns related to the k-th sense for wi (using Leacock-Chodorow measure)
JIGSAW_verbs: algorithm [2/4] wi=play C={basketball, soccer} I play basketball and soccer • (70) play -- (participate in games or sport; "We played hockey all afternoon"; "play cards"; "Pele played for the Brazilian teams in many important matches") • (29) play -- (play on an instrument; "The band played all night long") • … nouns(play,1): game, sport, hockey, afternoon, card, team, match nouns(play,2): instrument, band, night … nouns(play,35): …
game1 game2 game similarity … gamek sport1 sport2 sport … sportm JIGSAW_verbs: algorithm [3/4] wi=play C={basketball, soccer} nouns(play,1): game, sport, hockey, afternoon, card, team, match basketball1 … basketball basketballh MAXbasketball = MAXiSim(wi,basketball) winouns(play,1)
JIGSAW_verbs: algorithm [4/4] • finally, anoverall similarity score,(i, k), among sik and the whole context C is computed: • the synset assigned to wi is the one with the highest value
JIGSAW_others • Based on the WSD algorithm proposed by Banerjee and Pedersen [Banerjee07] (inspired to Lesk) • Idea: computes the overlap between the glosses of each candidate sense for the target word to the glosses of all words in its context • assigns the synset with the highest overlap score • if ties occur, the most common synset in WordNet is chosen
Evaluation • EVALITA All-Words-Task • disambiguate all words in a text • Dataset • 16 texts in Italian language • about 5000 words (tagged by ItalWordNet) • Processing • WSD needs others NLP steps: • Text normalization and Tokenization • Part-Of-Speech Tagging (based on ACOPOST) • Lemmatization (based on Morph-it! Resource) META
Evaluation (Results) • Comments: • results are encouraging considering that our system exploits only ItalWordNet • pre-processing phases, lemmatization and POS-tagging, introduce errors: • 77,66% lemmatization precision • 76,23% POS-tagging precision
Conclusions • Conclusions: • Knowledge-based WSD algorithm • Different strategy for each POS-tagging • Use only WordNet (ItalWordNet) and some heuristics • Advantage: use the same strategy for Italian and English [Basile07] • Drawback: low precision (now)
Future Work • Including new knowledge sources: • Web • (e.g. Topic Signature) • Wikipedia • (e.g. Wikipedia similarity) • Do not include resources that are available for only few languages • Use others heuristics: • Statistical distribution of senses instead WordNet frequency • WordNet domains
References • S. Banerjee and T. Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l Conf. on Computational Linguistics and Intelligent Text Processing,pages 136–145, London, UK, 2002. Springer-Verlag. • P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro. JIGSAW algorithm for word sense disambiguation. In SemEval-2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401. ACL press, 2007. • C. Leacock and M. Chodorow. Combining local context and wordnet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, pages 305–332. MIT Press, 1998. • P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.
Results (detail) • Comments: • polysemy of verb is high • generally proper nouns have only one sense • lemmatizer and pos-tagger work worst for adjectives
[s21 s22 … s1h] [sn1 sn2 … snm] [s11 s12 … s1k] JIGSAWnouns: The idea • Inspired by the idea proposed by Resnik [Resnik95] [ w1, w2, … wn ] [ w1 w2 … wn ] [0.4 0.3 0.5] [0.2 0.3 0.4] [0.6 0.1 0.2] • Most plausible assignment of senses to a set of co-occurring nouns is the one that maximizes the relatedness of meanings among the senses chosen • Relatedness is measured by computing a score for each sij • confidence with which sij is the most appropriate synset for wi • The score is a way to assign credit to word senses based on similarity with co-occurring words (support) [Resnik95]P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.
0.15 [s21 s22 … s1h] [sn1 sn2 … snm] [s11 s12 … s1k] the more similar two words are, the more informative will be the most specific concept that subsumes both of them 0.56 JIGSAWnouns: The support [ w1 w2 … wn ] most specific subsumer MSS semantic similarity score
MSS [s21 s22 … s1h] [sn1 sn2 … snm] [s11 s12 … s1k] Placentalmammal Carnivore Rodent Mouse (rodent) Feline, felid Cat (feline mammal) JIGSAWnouns: The idea cat mouse [ w1 w2 … wn ] most specific subsumer MSS feline mammal rodent [0.56 0.0 0.0] [0.56 0.0 0.56] [0.0 0.0 0.0] MSS = placental mammal