Language Independent Methods of Clustering Similar Contexts (with applications)

Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu EACL-2006 Tutorial

Language Independent Methods • Do not utilize syntactic information • No parsers, part of speech taggers, etc. required • Do not utilize dictionaries or other manually created lexical resources • Based on lexical features selected from corpora • Assumption: word segmentation can be done by looking for white spaces between strings • No manually annotated data of any kind, methods are completely unsupervised in the strictest sense EACL-2006 Tutorial

Clustering Similar Contexts • A context is a short unit of text • often a phrase to a paragraph in length, although it can be longer • Input: N contexts • Output: K clusters • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters EACL-2006 Tutorial

Applications • Headed contexts (contain target word) • Name Discrimination • Word Sense Discrimination • Headless contexts • Email Organization • Document Clustering • Paraphrase identification • Clustering Sets of Related Words EACL-2006 Tutorial

Tutorial Outline • Identifying lexical features • Measures of association & tests of significance • Context representations • First & second order • Dimensionality reduction • Singular Value Decomposition • Clustering • Partitional techniques • Cluster stopping • Cluster labeling • Hands On Exercises EACL-2006 Tutorial

General Info • Please fill out short survey • Break from 4:00-4:30pm • Finish at 6pm • Reception tonight at 7pm at Castle (?) • Slides and video from tutorial will be posted (I will send you email when that is ready) • Questions are welcome • Now, or via email to me or SenseClusters list. • Comments, observations, criticisms are all welcome • Knoppix CD, will give you Linux and SenseClusters when computer is booted from the CD. EACL-2006 Tutorial

SenseClusters • A package for clustering contexts • http://senseclusters.sourceforge.net • SenseClusters Live! (Knoppix CD) • Integrates with various other tools • Ngram Statistics Package • CLUTO • SVDPACKC EACL-2006 Tutorial

Many thanks… • Amruta Purandare (M.S., 2004) • Founding developer of SenseClusters (2002-2004) • Now PhD student in Intelligent Systems at the University of Pittsburgh http://www.cs.pitt.edu/~amruta/ • Anagha Kulkarni (M.S., 2006, expected) • Enhancing SenseClusters since Fall 2004! • http://www.d.umn.edu/~kulka020/ • National Science Foundation (USA) for supporting Amruta, Anagha and me via CAREER award #0092784 EACL-2006 Tutorial

Background and Motivations EACL-2006 Tutorial

Headed and Headless Contexts • A headed context includes a target word • Our goal is to cluster the target words based on their surrounding contexts • Target word is center of context and our attention • A headless context has no target word • Our goal is to cluster the contexts based on their similarity to each other • The focus is on the context as a whole EACL-2006 Tutorial

Headed Contexts (input) • I can hear the ocean in that shell. • My operating system shell is bash. • The shells on the shore are lovely. • The shell command line is flexible. • The oyster shell is very hard and black. EACL-2006 Tutorial

Headed Contexts (output) • Cluster 1: • My operating system shell is bash. • The shell command line is flexible. • Cluster 2: • The shells on the shore are lovely. • The oyster shell is very hard and black. • I can hear the ocean in that shell. EACL-2006 Tutorial

Headless Contexts (input) • The new version of Linux is more stable and better support for cameras. • My Chevy Malibu has had some front end troubles. • Osborne made on of the first personal computers. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EACL-2006 Tutorial

Headless Contexts (output) • Cluster 1: • The new version of Linux is more stable and better support for cameras. • Osborne made one of the first personal computers. • Cluster 2: • My Chevy Malibu has had some front end troubles. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EACL-2006 Tutorial

Web Search as Application • Web search results are headed contexts • Search term is target word (found in snippets) • Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” • If you click on search results or follow links in pages found, you will encounter headless contexts too… EACL-2006 Tutorial

Name Discrimination EACL-2006 Tutorial

George Millers! EACL-2006 Tutorial

EACL-2006 Tutorial

Email Foldering as Application • Email (public or private) is made up of headless contexts • Short, usually focused… • Cluster similar email messages together • Automatic email foldering • Take all messages from sent-mail file or inbox and organize into categories EACL-2006 Tutorial

Clustering News as Application • News articles are headless contexts • Entire article or first paragraph • Short, usually focused • Cluster similar articles together EACL-2006 Tutorial

What is it to be “similar”? • You shall know a word by the company it keeps • Firth, 1957 (Studies in Linguistic Analysis) • Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) • Harris, 1968 (Mathematical Structures of Language) • Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) • Miller and Charles, 1991 (Language and Cognitive Processes) • Various extensions… • Similar contexts will have similar meanings, etc. • Names that occur in similar contexts will refer to the same underlying person, etc. EACL-2006 Tutorial

General Methodology • Represent contexts to be clustered using first or second order feature vectors • Lexical features • Reduce dimensionality to make vectors more tractable and/or understandable • Singular value decomposition • Cluster the context vectors • Find the number of clusters • Label the clusters • Evaluate and/or use the contexts! EACL-2006 Tutorial

Identifying Lexical Features Measures of Association and Tests of Significance EACL-2006 Tutorial

What are features? • Features represent the (hopefully) salient characteristics of the contexts to be clustered • Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features • Vectors/contexts that include many of the same features will be similar to each other EACL-2006 Tutorial

Where do features come from? • In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered • This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection • It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step • Email or news articles EACL-2006 Tutorial

Feature Selection • “Test” data – the contexts to be clustered • Assume that the feature selection data is the same as the test data, unless otherwise indicated • “Training” data – a separate corpus of held out feature selection data (that will not be clustered) • may need to use if you have a small number of contexts to cluster (e.g., web search results) • This sense of “training” due to Schütze (1998) EACL-2006 Tutorial

Lexical Features • Unigram – a single word that occurs more than a given number of times • Bigram – an ordered pair of words that occur together more often than expected by chance • Consecutive or may have intervening words • Co-occurrence – an unordered bigram • Target Co-occurrence – a co-occurrence where one of the words is the target word EACL-2006 Tutorial

Bigrams • fine wine (window size of 2) • baseball bat • house of representatives (window size of 3) • president of the republic (window size of 4) • apple orchard • Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?) EACL-2006 Tutorial

Co-occurrences • tropics water • boat fish • law president • train travel • Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations EACL-2006 Tutorial

Bigrams and Co-occurrences • Pairs of words tend to be much less ambiguous than unigrams • “bank” versus “river bank” and “bank card” • “dot” versus “dot com” and “dot product” • Three grams and beyond occur much less frequently (Ngrams very Zipfian) • Unigrams are noisy, but bountiful EACL-2006 Tutorial

“occur together more often than expected by chance…” • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix • Throw out bigrams that include one or two stop words • Expected values are calculated, based on the model of independence and observed values • How often would you expect these words to occur together, if they only occurred together by chance? • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. EACL-2006 Tutorial

2x2 Contingency Table EACL-2006 Tutorial

Measures of Association EACL-2006 Tutorial

Interpreting the Scores… • G^2 and X^2 are asymptotically approximated by the chi-squared distribution… • This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … EACL-2006 Tutorial

Interpreting the Scores… • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis • H0: the words in the bigram are independent • 3.841 is associated with 95% confidence that the null hypothesis should be rejected EACL-2006 Tutorial

Measures of Association • There are numerous measures of association that can be used to identify bigram and co-occurrence features • Many of these are supported in the Ngram Statistics Package (NSP) • http://www.d.umn.edu/~tpederse/nsp.html EACL-2006 Tutorial

Measures Supported in NSP • Log-likelihood Ratio (ll) • True Mutual Information (tmi) • Pearson’s Chi-squared Test (x2) • Pointwise Mutual Information (pmi) • Phi coefficient (phi) • T-test (tscore) • Fisher’s Exact Test (leftFisher, rightFisher) • Dice Coefficient (dice) • Odds Ratio (odds) EACL-2006 Tutorial

Language Independent Methods of Clustering Similar Contexts (with applications)

Language Independent Methods of Clustering Similar Contexts (with applications)

Presentation Transcript

Unsupervised Word Sense Discrimination By Clustering Similar Contexts

Clustering

Clustering Methods

An Overview of Clustering Methods

Clustering Methods

Clustering

Independent Review of Applications

Clustering Methods

Finding Similar Sets

Overview of Natural Language Processing

Methods of Proving Triangles Similar

Presentation Contexts

Clustering Algorithms

Clustering Applications

Contexts-As-Clustering Making Sense of Social Contexts from Low-level Sensory Data

Unsupervised Word Sense Discrimination By Clustering Similar Contexts

Spatial Clustering Methods

Clustering Methods

Clustering

Clustering methods

Language Independent Methods of Clustering Similar Contexts (with applications)

Independent Component Clustering