210 likes | 288 Views
Discriminating Word Senses Using McQuitty’s Similarity Analysis. Amruta Purandare University of Minnesota, Duluth Advisor : Dr Ted Pedersen Research supported by National Science Foundation (NSF) Faculty Early Career Development Award (#0092784). Discriminating “line”.
E N D
Discriminating Word Senses Using McQuitty’s Similarity Analysis Amruta Purandare University of Minnesota, Duluth Advisor : Dr Ted Pedersen Research supported by National Science Foundation (NSF) Faculty Early Career Development Award (#0092784)
Discriminating “line” They will begin line formation before ceremony Connect modem to any jack on your line Quit printing after the last line of each file Your line will not get tied while you are connected to net Stand balanced and comfortable during line up Lines that do not fit a page are truncated New line service provides reliable connections Pages are separated by line feed characters They stand far right when in line formation
They will begin line formation before ceremony Stand balanced and comfortable during line up They stand far right when in line formation Your line will not get tied while you are connected to net Connect modem to any jack on your line New line service provides reliable connections Quit printing after the last line of each page Lines that do not fit a page are truncated Pages are separated by line feed characters
Introduction • What is Word Sense Discrimination ? • Unsupervised learning Clusters Training Features Test Feature Vectors similarity matrix evaluate
Representing context • Features (from training) • Bi grams • Unigrams • Second Order Co-occurrences/SOCs (Schütze98) • Mixture • Feature vectors (Binary) • Measuring similarity • Cosine • Match
McQuitty’s method • Pedersen & Bruce, 1997 • Agglomerative • UPGMA / Average Link • Stopping rules • Number of clusters • Score cutoff x+y/2 y x
Evaluation sense1 ( Maj ) sense2 sense3 sense4 c2 c3 c1 c4
Evaluation Accuracy=38/55=0.69 sense3 sense4 sense1 sense2
Majority Sense Classifier Maj. =17/55=0.31 sense2
Scope of the experiments • 584 experiments (73 * 4 * 2) • 73 Words: 72 Senseval-2, LINE • 4 Features: Bi grams, Unigrams, SOCs, Mix • 2 Similarity Measures: Match, Cosine • Window = 5 • for Bi grams and SOCs • Frequency cutoff = 2
Senseval-2 Results POS wise 29 NOUNS 28verbs 15 adjs Maj=0.57 Maj=0.51 Maj=0.64 No of words of a POS for which experiment obtained accuracy more than Majority
Senseval-2 Results Feature wise SOC UNI BI 32 18 38 72 words X 2 measures = 144
Senseval-2 Results Measure wise COS MAT 49 39 72 words x 3 features = 216
Line Results Maj = 0.16 On uniform distribution of 6 senses
Sample Confusion Table (fine.soc.cos) S0 = elegant S1 = small grained S2 = superior S3 = satisfactory S4 = thin 60 precision = 36/60 = 60.00
Conclusions • Small set of SOCs was powerful • Half the number of unigrams/bigrams • Scaling done by Cosine helps ! • Need more training data! • Need to improve feature… • Selection (Tests of associations) • extraction (Stemming) • matching (Fuzzy matching) …strategies for bi grams • Explore new features • POS • Collocations
Recent work • PDL implementation • Cluto - Clustering Toolkit http://www-users.cs.umn.edu/~karypis/cluto • 6 clustering methods, 12 merging criteria • Plans • Comparing clustering in similarity space Vs vector space (Schütze, 1998) • Stopping rules
Sense labeling They will begin line formation before ceremony Stand balanced and comfortable during line up They stand far right when in line formation formation Your line will not get tied while you are connected to net Connect modem to any jack on your line New line service provides reliable connections phone Quit printing after the last line of each file Lines that do not fit a page are truncated Pages are separated by line feed characters text
Software Packages • SenseClusters (Our Discrimination Toolkit) http://www.d.umn.edu/~tpederse/senseclusters.html • PDL (Used to implement clustering algorithms) http://pdl.perl.org/ • NSP (Used for extracting features) http://www.d.umn.edu/~tpederse/nsp.html • SenseTools (Used for preprocessing, feature matching) http://www.d.umn.edu/~tpederse/sensetools.html • Cluto (Clustering Toolkit) http://www-users.cs.umn.edu/~karypis/cluto