380 likes | 565 Views
詞義辨識. 機器學習演算法特徵的選取與組合. 本文由高照明老師與高紹航同學撰寫並發表於 ROCLING 2007 , 再由 Patty Liu 於 SLP Lab meeting 時報告. 臺灣大學資訊工程學系 高紹航 臺灣大學外國語文學系 高照明. Presented by Patty Liu. Outline. Word Sense Disambiguation Senseval-2 Bayesian Classification Forward Sequential Selection Algorithm The features we applied
E N D
詞義辨識 機器學習演算法特徵的選取與組合 本文由高照明老師與高紹航同學撰寫並發表於 ROCLING 2007, 再由 Patty Liu於SLP Lab meeting時報告 臺灣大學資訊工程學系高紹航 臺灣大學外國語文學系高照明 Presented by Patty Liu
Outline • Word Sense Disambiguation • Senseval-2 • Bayesian Classification • Forward Sequential Selection Algorithm • The features we applied • Result
Word Sense Disambiguation • A word may have more then one sense • Ex. Bank-銀行, 河堤, 庫 • The task of WSD is to automatically identify the correct sense in a given context.
Senseval-2 • Published in 2001 • Senseval-2 English lexical sample • 73 different target words, including nouns, verbs , and adjectives.
Senseval http://www.senseval.org/
Senseval-2 Competition http://193.133.140.102/senseval2/
Corpora of Senseval-2 Competition http://193.133.140.102/senseval2/Results/guidelines.htm#rawdata Example: ..\corpora\english-lex-sample\train\eng-lex-sample.training.xml <instance id="art.40001" docsrc="bnc_ACN_245"> <answer instance="art.40001"senseid="art%1:06:00::"/> <context>Their multiscreen projections of slides and film loops have featured in orbital parties, at the Astoria and Heaven, in Rifat Ozbek's 1988/89 fashion shows, and at Energy's recent Docklands all-dayer.From their residency at the Fridge during the first summer of love, Halo used slide and film projectors to throw up a collage of op-art patterns, film loops of dancers like E-Boy and Wumni, and unique fractals derived from video feedback.&bquo;We're not aware of creating a visual identify for the house scene, because we're right in there.We see a dancer at a rave, film him later that week, and project him at the next rave.&equo;[hi]Ben Lewis [/hi] Halo can be contacted on 071 738 3248. [ptr][/p] [caption] <head>Art</head>you can dance to from the creative group called Halo [/caption] [/div2] [div2] [head] </context> </instance>
Bayesian Classification • Suppose the target word has k senses, s1, s2, …, sk • Find s’ such that is maximum, c is the context or features of the target word
Forward Sequential Selection Algorithm • Used in feature selection • First let • First add the best feature into S and then iteratively add into S the best feature in the remaining feature set until the performance cannot be improved. • The final S is approximately the best feature set
The features we applied • We tried 9 feature, named F1, F2, …F9
The features we applied-F1 • The words around the target word excluding stop words such as “is”, “a” • Best window size is 3
The features we applied-F2 • Similar to F1, but include the information of relative position of the target word • Include stop words • For example, “The art of design” • {(The, -1), (of, 1), (design, 2)}
Window Size Test of F2 • Best window size is 1
The features we applied-F3 • Similar to F2, but use part-of-speech instead • “The art of design” • design: (n, 2) • Best window size is 1
The features we applied-F4 • Ngrams containing the target word . • “The art of design” • {(The-art), (art-of), (The-art-of), (art-of-design), (The-art-of-design)} • Best window size is 3
The features we applied-F5 • Similar to F4, but use part-of-speech instead • such as (n-prep-n) for art-of-design • Best window size is 4
The features we applied-F6 • Use word sketch in the sketch engine to extract all possible collocations involving the target word • Best window size is 5 • Best dependency type is {modifiers, object, n_modifier, a_modifier, and/or, modifier}
Word Sketch in the Sketch Engine http://www.sketchengine.co.uk/auth/
The features we applied-F7 • Use the Stanford parser to identify the dependency relations, ex: object_of, modifies • The Precision is 54.6% . • Stanford was developed by Klein and Manning in 2003
Stanford Parser http://nlp.stanford.edu:8080/parser/
Some output of Stanford parser • det(government-2, The-1) • nsubj(established-4, government-2) • advmod(established-4, first-3) • amod(system-8, modern-5) • amod(system-8, criminal-6) • nn(system-8, investigation-7) • dobj(established-4, system-8) • prep(system-8, in-9) • pobj(in-9, 1946-10)
The features we applied-F8 • Use the top HowNet semantic features of the word before and after the target word. • The Precision is 47.2% . • HowNet is developed by董振東. • Example: Hownet representations of 醫生 ‘doctor’ • {human| 人:HostOf={Occupation| 職位},domain={medical| 醫},{doctor| 醫治:agent={~}}}
The features we applied-F9 • First use Stanford parser to identify words which have dependency relations with the target word. • Then use the top HowNet semantic features as feature • The Precision is 54.1% .
Result • Best feature set is {F1, F2, F4, F7} • The Precision 61.2% The best performance in Senseval-2 is 64.2%.
Conclusion • In terms of the collocation types, the feature of object and modifier play more important roles than subject in WSD.
Future Research New features and other machine learning algorithms like SVM and CRF might improve the performance.