410 likes | 532 Views
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information. Yong-Gu Lee 2007-08-17 yonggulee@hotmail.com. Contents. Introduction Related Works Research Goals Effective Word Sense Disambiguation Model and Evaluation Conclusion. Introduction.
E N D
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17 yonggulee@hotmail.com
Contents • Introduction • Related Works • Research Goals • Effective Word Sense Disambiguation Model and Evaluation • Conclusion
Introduction • Word Sense Disambiguation (WSD) • The problem of selecting a sense for a word from a set of predefined possibilities. • “Intermediate task” which is not an end in itself, but rather is necessary at one level or another. • Obviously essential for language understanding applications. • Machine translation • Information retrieval and hypertext navigation • Content and thematic analysis • Speech processing and text processing
Related works(1/3) • Approaches to WSD • Knowledge-Based Disambiguation • use of external lexical resources such as dictionaries and thesauri • discourse properties • Corpus-based Disambiguation • Hybrid Disambiguation
Related works(2/3) • Corpus-based Disambiguation • Supervised Disambiguation • based on a labeled training set • the learning system has: • a training set of feature-encoded inputs AND • their appropriate sense label (category) • Unsupervised Disambiguation • based on unlabeled corpora • The learning system has: • a training set of feature-encoded inputs BUT • NOT their appropriate sense label (category)
Related works(3/3) • Lexical Resources for WSD • Machine readable format • Machine Readable Dictionaries (MRD) : Longman, Oxford, etc • Thesauri and semantic networks : Roget Thesaurus, Wordnet, etc • Sense tagged data • Senseval-1,2,3(www.senseval.org) • Provides sense annotated data for many languages, for several tasks • Languages: English, Romanian, Chinese, Basque, Spanish, etc. • Tasks: Lexical Sample, All words, etc. • SemCor, Hector, etc
Research Motivation • Manual sense tagging • Labor-intensive and high cost • Limitation of available sense tagged corpus • Except English, other languages have a few corpus for WSD. • Coverage of sense tagged words • Some corpus has only one or a few words whose sense was tagged. • “Line” corpus, “interest” corpus, etc • If using supervised disambiguation method, the only word that appeared in the sense tagged corpus is disambiguated.
Research Goals • Minimize or eliminate the cost of manual labeling. • Automatic sense tagging using MRD and heuristic rules • Improve the performance of word sense disambiguation. • Using supervised disambiguation • Naïve Bayes classifier
Effective Word Sense Disambiguation Model • Automatic Tagging Technique • Experimental Environment • Evaluation of Automatic Tagging Technique • Evaluation of Sense Classification • Evaluation of Fusion Method
An Outline Diagram for the Proposed Research Sense Classification Automatic Sense Tagging and Training Collocation Extraction Collection Test Context Context Extraction of Target Word Sense Tagging Classify Word Sense Auto Sense Tagging Module Naïve Bayes Classifier Training Set Key Word Extraction Evaluation Dictionary
Automatic Tagging Technique • Dictionary Information-based Method • Collocation Overlap-based Method • Data Fusion Method • Dictionary Information-based Method + Collocation Overlap-based Method
Dictionary Information-based Method(1/2) • Extract necessary information from dictionary. • Heuristic 1: One Sense per Collocation / One Sense per Discourse • Telephone line, 景氣展望/Gyeonggi-jeonmang(economic prospect) • Heuristic 2: Using of corresponding Chinese characters • 감자/Gamja : 柑子(Potato)/減資(Reduction of capital) • Heuristic 3: Co-occurrence of synonym, antonym and related terms. • Heuristic 4: Occurrence of the derived words
Dictionary Information-based Method(2/2) • Heuristic 5: Co-occurrence of key feature that is extracted from definition of target word entry like Lesk(1986). • Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the overlap between each sense definition and the current context • Choose senses that lead to highest overlap
Collocation Overlap-based Method • Semantic similarity metric using the collocation overlap • Algorithm: • Retrieve keywords from MRD all sense definitions of the words to be disambiguated • Extract collocation words of the keywords from test collection by threshold • Extract collocation words of the target words from the test collection • Determine the overlap of each collocation words(2, 3) • Choose senses that lead to highest overlap
Feature Selection • By document frequency • Test Collection -> docDF • Definitions as documents -> dicDF • docDF <= 5000 & dicDF <= 300
Sense Classification : Naïve Bayes Classifier* • Algorithm: * source: Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing.
Experimental Environment(1/2) • Test Collection • Includes all the articles(127,641) in three Korean daily newspapers for the year 2004 • Use part-of-speech tagger and lexical analysis • Evaluation • Accuracy
Evaluation of Automatic Sense Tagging • Dictionary Information-based Method • By Rule
Evaluation of Automatic Sense Tagging • Collocation Co-occurrence-based Method • Performance by threshold
Auto Tagging Result of Top 30 • By Target Words
Auto Tagging Result of Top 30 • By Information type
Build a Classifier • Train set : 600 • Test set : others • Window size: 50byte length • Rule for making train set • There are errors in the automatic sense tagging. • For reducing errors and improving tagging accuracy of train set, information type of the high accuracy is firstly used.
Sense Classification- Collocation Overlap-based Method • By rank
Sense Classification- Collocation Overlap-based Method • By target words
Data Fusion of Two Auto Tagging Methods • Dictionary Information base Method : Using all the information type except definitions • Collocation Overlap base Method : Using only the information type of Top10
Results of the Auto Tagging Method in Data Fusion –Information Type
Conclusion(1/2) • The performance of the automatic tagging technique differed depending on the type of information sources in the dictionary. • In case of the frequently used keywords extracted from the dictionary, to apply feature selection method is needed.
Conclusion(2/2) • The word sense disambiguation model using the automatic tagging method based on dictionary information showed a comparable performance to the supervised learning method using manual tagging information. • The WSD model using data fusion technique combing two automatic tagging methods outperforms the model using a single tagging method.