10 likes | 161 Views
Evaluating Word Sense Induction and Disambiguation Methods. Ioannis P. Klapaftis • Suresh Manandhar Poster by Sumedh Masulkar Guided by Prof. Amitabh Mukherjee. INTRODUCTION.
E N D
Evaluating Word Sense Induction and Disambiguation Methods Ioannis P. Klapaftis • Suresh Manandhar Poster by Sumedh Masulkar Guided by Prof. Amitabh Mukherjee INTRODUCTION Word Sense Induction is the method to deduce automatically the senses or uses of a given word with multiple meanings (known as target word) directly from a text without relying on any external resources such as dictionaries or sense-tagged data. It is also known as unsupervised Word Sense Disambiguation, since Word Sense Induction(WSI) methods automatically disambiguates the ambiguous occurrences of a given word. This paper presents a thorough description of the SemEval-2010 WSI task and a new evaluation setting for sense induction methods. WSD was first formulated as a computational task during the early days of machine translation in the 1940s, making it one of the oldest problems in NLP. Few applications of WSI include differentiating homonyms in Web Information Retrieval(IR). They help to develop a theory of text based IR. WSI is also useful in various search engines to get better results for a query, in online translations of various websites or texts for making content available in different languages to different people. PREVIOUS WORKS By Agirre et al.(2006), where they propose evaluating and optimizing the parameters of an unsupervised graph-based WSD algorithm. By AitorSoroa and EnekoAgirre (SemEval-2007), Evaluating Word Sense Induction and Discrimination Systems, where they proposed method for comparison across sense-induction and discrimination systems, and also to compare these systems to other supervised and knowledge-based systems. By Ioannis P. Klapaftis and Suresh Manandhar (SemEval-2010), Evaluation Setting for Word Sense Induction and Disambiguation Systems. F-score in the setting of the SemEval-2007 WSI task, suffers from the matching problem which does not allow: (1) the assessment of the entire membership of clusters, and (2) the evaluation of all clusters in a given solution. In this paper, the authors present the use of V-measure as a measure of objectively assessing WSI methods in an unsupervised setting, and also a small modification on the supervised evaluation. By Marianna Apidianaki and Tim Van de Cruys, A Quantitative Evaluation of Global Word Sense Induction where the authors compare the performance of such algorithms to an algorithm that uses a `global' approach, i.e. the different senses of a particular word are determined by comparing them to, and demarcating them from, the senses of other words in a full-blown word space model. APPROACH The authors present a new evaluation setting by assessing participating systems’ performance according to the skewness of target words’ distribution of senses showing that there are methods able to perform well above the Most Frequent Sense (MFS) baseline in highly skewed distributions. In SemEval-2007, the authors used F-score (harmonic of precision and recall) SemEval-2010 introduces a new measure V-Measure which is harmonic of h(homogeneity) and c(completeness). In this paper, the authors define the skewness of a distribution, where xirefers to the frequency of sense i, i.e. number of target word instances that have been tagged with sense i in the gold standard, x refers to the mean of the distribution and N is the total number of target word instances. For a given class (noun or verb) the three categories were generated by following the following process: 1. The skewness of target words was calculated. 2. Target words were sorted according to their skewness 3. All target words were assigned to one skewness category, so that all three categories roughly have the same total number of target word instances. Results(2) Results(1) conclusions This paper presented the main difference of the task from corresponding SemEval-2007 and SemEval-2010 WSI challenge, and subsequently evaluated participating systems in terms of their unsupervised (V-Measure, paired F-Score) and supervised performance according to the skewness of target words distribution of senses. The results seem to justify the authors' claim and the particular evaluation measure does not offer any discriminative information among the three categories. This is an important evaluation setting, in which the results of systems can be interpreted in terms of the number of generated clusters and the distribution of target word instances within the clusters. CONTACT Sumedh Masulkar sumedh@iitk.ac.in (+91)8005463472 As in the official evaluation, we also observe that systems generating a higher number of clusters achieve a high V-Measure, although their performance does not increase monotonically with the number of clusters increasing. Given that performance in the paired F-Score seems to be more biased towards a small number of clusters, than V-Measure was towards a high number of clusters, the particular evaluation measure does not offer any discriminative information among the three skewness categories.