E N D
Annotating Words usingWordNet Semantic GlossesJulian SzymańskiDepartment of Computer Systems Architecture,Faculty of Electronics, Telecommunications and Informatics,Gdańsk University of Technology, Polandjulian.szymanski@eti.pg.gda.pl Włodzisław DuchDepartment of Informatics, Nicolaus Copernicus University, Toruń, PolandSchool of Computer Engineering, Nanyang Technological University, SingaporeGoogle: W. Duch
Outline Motivation for Word Sense Disambiguation “Semantic Glosses” approach SG algorithm SG in action Aggregated results from small experiments Conclusions, problems and (possible) solutions Deliverables
Introduction Ambiguity of natural language is the source of many problems in automatic text processing. It is quite evident for example in classification or clustering of documents represented by features derived from word frequencies. Automatic semantic annotation is still a great challenge, requiring solution to the word sense disambiguation (WSD) problem. WSD address many issues:How to distinguish and represent word meanings? How to create semantic Web? Manually: introduction of elementary atoms of meaning. Set level of granularity of senses, relations to each other. Synonyms and/or homonyms must be considered acquiring word senses in an automatic way. So far most successful: Latent Semantic Indexing. Semantic annotations allow to go beyond bag-of words representation.
Ourapproach Focus on word sense disambiguation during initial text processing phase, map words from texts to the structures that carry elementary meanings that may be treated as semantic atoms (senses). WordNetsynsets group words into sets of synonyms related to word definitions, provide sense identifiers, record semantic relations between synsets. Employ synsets for using WordNet semantic network formed by relations between synsets. Text annotated at a higher abstraction level can be clustered in a better way because similarities between texts are more clear. Enhance document representation with superordinate categories. Works even better for clustering, simulating spreading of neural activation responsible for associations and simple inferences taking place in the reader’s brain. The main issue is how to map words into synsets.
Atlas Semantyczny spirit: 79 words69 cliques = minimal units with specific meaning. Synset = collection of synonyms in Wordnet. http://dico.isc.cnrs.fr/en/index.html
Typical approaches to WSD for selecting proper sense of a given words employ hierarchy of taxonomical relations, anaylse the disambiguated word context to find features that allows to select its proper meaning (eg. Lesk algorithm). Starting with the version 3.0 WordNet also provides semantically annotated disambiguated gloss corpus. Glosses are short definitions providing proper meanings of words and thus whole synsets. The gloss annotations cover also concepts, collocations (multiword forms), tagging discontinuous spans of text. For example. “personal or business relationship” is converted to “personal_relationship”, “business_relationship”. Glosses have been linked manually to the context-appropriate sense in WordNet, disambiguating the corpus. Semantic Glosses (SG) approach employs relations between synsets, or more precisely relations obtained from references between synsets that are related to their definitions. They form a network of conceptually related synsets in opposition to structuralized hierarchy.
Thealgorithm Disambiguated word W is mapped on its possible meanings (synsets) {Ts(W)}. For each synset from {Ts(W)} set retrieve all synsetsTgs that may be derivedfrom its glosses. Rank all Ts synset according to the number of relations with glosses in Tgs.
Example First create test sets for multi-sense words. Each sense has it own text. We compare our approach (SG) against Stanford parser (SP).
Aggregatedresults The evaluation of the SG approach has been performed on a test set of eight multisense words. For different senses of these words 51 test texts have been prepared and manually evaluated annotating proper senses.
Conclusions I Good: The algorithm that employs semantically annotated glosses provides quite promisingresults. So far it has been evaluated only on a small test set of 8 multi sense words (51different meanings). As the preliminary results are promising the method is now beingtested on a larger scale, mamy improvements will be introduced.
Conclusions: problems Different meanings of thesame word in one sentence eg: Turtle’s shells provide protection to parts of the animalbody, like egg shell protects birds’ embryo.The first ‘shell’ is related to the turtleshell, the second to egg shell. Disambiguating such cases is relativelyeasy for humans, because using semantic memory collocations are easily discovered andrequire much smaller context for proper sense classification. Experiments with variablecontext length dependent on the number of identical words with different meanings inone sentence will be performed to check how to deal with such difficulties.
Conclusions: more problems Some WordNetsynsets are larger and have more relations than others, the distributionis very uneven. This causes preference for larger synsets that may confuse manyalgorithms degrading results for meanings that correspond to synsets with small numberof relations. To simulate effects of spreading activation weighed relations betweensynsets may be introduced, describing patterns of more and less important activations.
Few more ideas Explore the use of WordNet structural information given in predefinedrelations that extends the network of relations between synsets. Use references between glosses obtained from higher order relations that should havesmaller weights. Employ additional relations from mining Wikipedia hyperreferences to introduce more relations between synsets. This task requires first amapping betweenWordNetsynsets and Wikipedia articles. Results of the semi-automaticapproach to perform such mapping are quite good. Challenge: use of negativeknowledge about the words present in glosses that do not appear in the widercontext.
Deliverables The application for disambiguating and evaluation can be downloaded free from: http://kask.eti.pg.gda.pl/semagloss/annotations.zip This project resulted also in development of API in C# and Java for WordNet semantically annotated gloss corpus. The API is available for download http://kask.eti.pg.gda.pl/semagloss/index.html Associating WordNet with Wikipedia http://kask.eti.pg.gda.pl/CompWiki => WordNet tab.
Thank you for lending your ears http://kask.eti.pg.gda.pl/CompWiki Google: W Duch => Papers