Yong-Gu Lee 2007-08-17 yonggulee@hotmail

An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee 2007-08-17 yonggulee@hotmail.com

Contents • Introduction • Related Works • Research Goals • Effective Word Sense Disambiguation Model and Evaluation • Conclusion

Introduction • Word Sense Disambiguation (WSD) • The problem of selecting a sense for a word from a set of predefined possibilities. • “Intermediate task” which is not an end in itself, but rather is necessary at one level or another. • Obviously essential for language understanding applications. • Machine translation • Information retrieval and hypertext navigation • Content and thematic analysis • Speech processing and text processing

Related works(1/3) • Approaches to WSD • Knowledge-Based Disambiguation • use of external lexical resources such as dictionaries and thesauri • discourse properties • Corpus-based Disambiguation • Hybrid Disambiguation

Related works(2/3) • Corpus-based Disambiguation • Supervised Disambiguation • based on a labeled training set • the learning system has: • a training set of feature-encoded inputs AND • their appropriate sense label (category) • Unsupervised Disambiguation • based on unlabeled corpora • The learning system has: • a training set of feature-encoded inputs BUT • NOT their appropriate sense label (category)

Related works(3/3) • Lexical Resources for WSD • Machine readable format • Machine Readable Dictionaries (MRD) : Longman, Oxford, etc • Thesauri and semantic networks : Roget Thesaurus, Wordnet, etc • Sense tagged data • Senseval-1,2,3(www.senseval.org) • Provides sense annotated data for many languages, for several tasks • Languages: English, Romanian, Chinese, Basque, Spanish, etc. • Tasks: Lexical Sample, All words, etc. • SemCor, Hector, etc

Research Motivation • Manual sense tagging • Labor-intensive and high cost • Limitation of available sense tagged corpus • Except English, other languages have a few corpus for WSD. • Coverage of sense tagged words • Some corpus has only one or a few words whose sense was tagged. • “Line” corpus, “interest” corpus, etc • If using supervised disambiguation method, the only word that appeared in the sense tagged corpus is disambiguated.

Research Goals • Minimize or eliminate the cost of manual labeling. • Automatic sense tagging using MRD and heuristic rules • Improve the performance of word sense disambiguation. • Using supervised disambiguation • Naïve Bayes classifier

Effective Word Sense Disambiguation Model • Automatic Tagging Technique • Experimental Environment • Evaluation of Automatic Tagging Technique • Evaluation of Sense Classification • Evaluation of Fusion Method

An Outline Diagram for the Proposed Research Sense Classification Automatic Sense Tagging and Training Collocation Extraction Collection Test Context Context Extraction of Target Word Sense Tagging Classify Word Sense Auto Sense Tagging Module Naïve Bayes Classifier Training Set Key Word Extraction Evaluation Dictionary

Automatic Tagging Technique • Dictionary Information-based Method • Collocation Overlap-based Method • Data Fusion Method • Dictionary Information-based Method + Collocation Overlap-based Method

Dictionary Information-based Method(1/2) • Extract necessary information from dictionary. • Heuristic 1: One Sense per Collocation / One Sense per Discourse • Telephone line, 景氣展望/Gyeonggi-jeonmang(economic prospect) • Heuristic 2: Using of corresponding Chinese characters • 감자/Gamja : 柑子(Potato)/減資(Reduction of capital) • Heuristic 3: Co-occurrence of synonym, antonym and related terms. • Heuristic 4: Occurrence of the derived words

Dictionary Information-based Method(2/2) • Heuristic 5: Co-occurrence of key feature that is extracted from definition of target word entry like Lesk(1986). • Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the overlap between each sense definition and the current context • Choose senses that lead to highest overlap

Collocation Overlap-based Method • Semantic similarity metric using the collocation overlap • Algorithm: • Retrieve keywords from MRD all sense definitions of the words to be disambiguated • Extract collocation words of the keywords from test collection by threshold • Extract collocation words of the target words from the test collection • Determine the overlap of each collocation words(2, 3) • Choose senses that lead to highest overlap

Feature Selection • By document frequency • Test Collection -> docDF • Definitions as documents -> dicDF • docDF <= 5000 & dicDF <= 300

Sense Classification : Naïve Bayes Classifier* • Algorithm: * source: Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing.

Experimental Environment(1/2) • Test Collection • Includes all the articles(127,641) in three Korean daily newspapers for the year 2004 • Use part-of-speech tagger and lexical analysis • Evaluation • Accuracy

Target Word for WSD

Evaluation of Automatic Sense Tagging • Dictionary Information-based Method • By Rule

Results of Feature Selection- words

Results of Feature Selection - Rule

Evaluation of Automatic Sense Tagging • Collocation Co-occurrence-based Method • Performance by threshold

Auto Tagging Result of Top 30 • By Target Words

Auto Tagging Result of Top 30 • By Information type

Comparison of Two Auto Tagging Methods

Build a Classifier • Train set : 600 • Test set : others • Window size: 50byte length • Rule for making train set • There are errors in the automatic sense tagging. • For reducing errors and improving tagging accuracy of train set, information type of the high accuracy is firstly used.

Sense Classification- Dictionary Information-based Method

Sense Classification- Collocation Overlap-based Method • By rank

Sense Classification- Collocation Overlap-based Method • By target words

Comparison of Two Sense Classifications

Data Fusion of Two Auto Tagging Methods • Dictionary Information base Method : Using all the information type except definitions • Collocation Overlap base Method : Using only the information type of Top10

Results of the Auto Tagging Method in Data Fusion - Words

Results of the Auto Tagging Method in Data Fusion –Information Type

Comparison of the Three Auto Tagging Methods

Sense Classification in Data Fusion - Words

Comparison of Three WSD Methods

Conclusion(1/2) • The performance of the automatic tagging technique differed depending on the type of information sources in the dictionary. • In case of the frequently used keywords extracted from the dictionary, to apply feature selection method is needed.

Conclusion(2/2) • The word sense disambiguation model using the automatic tagging method based on dictionary information showed a comparable performance to the supervised learning method using manual tagging information. • The WSD model using data fusion technique combing two automatic tagging methods outperforms the model using a single tagging method.

Q&A

Yong-Gu Lee 2007-08-17 yonggulee@hotmail

Yong-Gu Lee 2007-08-17 yonggulee@hotmail

Presentation Transcript

CMHCs 2007-08

Budget 2007-08

Si-Yong Lee

Yong Seok Heo , Kyoung Mu Lee and Sang Uk Lee

Yong Seok Heo , Kyoung Mu Lee, and Sang Uk Lee

Yong Hee Lee (yhlee@kaeri.re.kr)

2007/08 Season

Chapter 08 Author: Lee Hannah

FYE 2007-08

Lee County Fertilizer Ordinance No. 08-08

Lee, Yong-joo

HESA 2007/08

9/17/08

17/08/10

2007-08 Assessment

p-type doping GaN Lee Yong Hee

Yong Zhou 2012-03-08

10/17/08

Lee Windham TXU June 17-20, 2007 Charleston, SC