Discriminating Word Senses Using McQuitty’s Similarity Analysis

Discriminating Word Senses Using McQuitty’s Similarity Analysis Amruta Purandare University of Minnesota, Duluth Advisor : Dr Ted Pedersen Research supported by National Science Foundation (NSF) Faculty Early Career Development Award (#0092784)

Discriminating “line” They will begin line formation before ceremony Connect modem to any jack on your line Quit printing after the last line of each file Your line will not get tied while you are connected to net Stand balanced and comfortable during line up Lines that do not fit a page are truncated New line service provides reliable connections Pages are separated by line feed characters They stand far right when in line formation

They will begin line formation before ceremony Stand balanced and comfortable during line up They stand far right when in line formation Your line will not get tied while you are connected to net Connect modem to any jack on your line New line service provides reliable connections Quit printing after the last line of each page Lines that do not fit a page are truncated Pages are separated by line feed characters

Introduction • What is Word Sense Discrimination ? • Unsupervised learning Clusters Training Features Test Feature Vectors similarity matrix evaluate

Representing context • Features (from training) • Bi grams • Unigrams • Second Order Co-occurrences/SOCs (Schütze98) • Mixture • Feature vectors (Binary) • Measuring similarity • Cosine • Match

Feature examples

McQuitty’s method • Pedersen & Bruce, 1997 • Agglomerative • UPGMA / Average Link • Stopping rules • Number of clusters • Score cutoff x+y/2 y x

Evaluation sense1 ( Maj ) sense2 sense3 sense4 c2 c3 c1 c4

Evaluation Accuracy=38/55=0.69 sense3 sense4 sense1 sense2

Majority Sense Classifier Maj. =17/55=0.31 sense2

Experimental Data

Scope of the experiments • 584 experiments (73 * 4 * 2) • 73 Words: 72 Senseval-2, LINE • 4 Features: Bi grams, Unigrams, SOCs, Mix • 2 Similarity Measures: Match, Cosine • Window = 5 • for Bi grams and SOCs • Frequency cutoff = 2

Senseval-2 Results POS wise 29 NOUNS 28verbs 15 adjs Maj=0.57 Maj=0.51 Maj=0.64 No of words of a POS for which experiment obtained accuracy more than Majority

Senseval-2 Results Feature wise SOC UNI BI 32 18 38 72 words X 2 measures = 144

Senseval-2 Results Measure wise COS MAT 49 39 72 words x 3 features = 216

Line Results Maj = 0.16 On uniform distribution of 6 senses

Sample Confusion Table (fine.soc.cos) S0 = elegant S1 = small grained S2 = superior S3 = satisfactory S4 = thin 60 precision = 36/60 = 60.00

Conclusions • Small set of SOCs was powerful • Half the number of unigrams/bigrams • Scaling done by Cosine helps ! • Need more training data! • Need to improve feature… • Selection (Tests of associations) • extraction (Stemming) • matching (Fuzzy matching) …strategies for bi grams • Explore new features • POS • Collocations

Recent work • PDL implementation • Cluto - Clustering Toolkit http://www-users.cs.umn.edu/~karypis/cluto • 6 clustering methods, 12 merging criteria • Plans • Comparing clustering in similarity space Vs vector space (Schütze, 1998) • Stopping rules

Sense labeling They will begin line formation before ceremony Stand balanced and comfortable during line up They stand far right when in line formation formation Your line will not get tied while you are connected to net Connect modem to any jack on your line New line service provides reliable connections phone Quit printing after the last line of each file Lines that do not fit a page are truncated Pages are separated by line feed characters text

Software Packages • SenseClusters (Our Discrimination Toolkit) http://www.d.umn.edu/~tpederse/senseclusters.html • PDL (Used to implement clustering algorithms) http://pdl.perl.org/ • NSP (Used for extracting features) http://www.d.umn.edu/~tpederse/nsp.html • SenseTools (Used for preprocessing, feature matching) http://www.d.umn.edu/~tpederse/sensetools.html • Cluto (Clustering Toolkit) http://www-users.cs.umn.edu/~karypis/cluto

Discriminating Word Senses Using McQuitty’s Similarity Analysis

Discriminating Word Senses Using McQuitty’s Similarity Analysis

Presentation Transcript

Protein Structure Similarity

Protein Structure Similarity

Wordnet and word similarity Lectures 11 and 12

Network Traffic Self-Similarity

SENSES AND EAR

Sensory Physiology

SPECIAL SENSES

FEM 4100 Topic 5

Chapter 16 Sense Organs

DSA Developmental Spelling Analysis

The Body, Mind, and Senses

The 5 Senses

Word Chapter 1

Bell Work 4/25/11

Computational Movement Analysis Lecture 1: Similarity Joachim Gudmundsson

Ch 11 . Assessing Pairwise Sequence Similarity: BLAST and FASTA

FREE-WORD COMBINATIONS

17 The Special Senses

English Genre

The Senses

Special Senses