1.64k likes | 1.77k Views
Language Independent Methods of Clustering Similar Contexts (with applications). Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu. The Problem. A context is a short unit of text
E N D
Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu EuroLAN-2005 Summer School
The Problem • A context is a short unit of text • often a phrase to a paragraph in length, although it can be longer • Input: N contexts • Output: K clusters • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters EuroLAN-2005 Summer School
Language Independent Methods • Do not utilize syntactic information • No parsers, part of speech taggers, etc. required • Do not utilize dictionaries or other manually created lexical resources • Based on lexical features selected from corpora • No manually annotated data of any kind, methods are completely unsupervised in the strictest sense • Assumption: word segmentation can be done by looking for white spaces between strings EuroLAN-2005 Summer School
Outline (Tutorial) • Background and motivations • Identifying lexical features • Measures of association & tests of significance • Context representations • First & second order • Dimensionality reduction • Singular Value Decomposition • Clustering methods • Agglomerative & partitional techniques • Cluster labeling • Evaluation techniques • Gold standard comparisons EuroLAN-2005 Summer School
Outline (Practical Session) • Headed contexts • Name Discrimination • Word Sense Discrimination • Abbreviations • Headless contexts • Email/Newsgroup Organization • Newspaper text • Identifying Sets of Related Words EuroLAN-2005 Summer School
SenseClusters • A package designed to cluster contexts • Integrates with various other tools • Ngram Statistics Package • Cluto • SVDPACKC • http://senseclusters.sourceforge.net EuroLAN-2005 Summer School
Many thanks… • Satanjeev (“Bano”) Banerjee (M.S., 2002) • Founding developer of the Ngram Statistics Package (2000-2001) • Now PhD student in the Language Technology Institute at Carnegie Mellon University http://www-2.cs.cmu.edu/~banerjee/ • Amruta Purandare (M.S., 2004) • Founding developer of SenseClusters (2002-2004) • Now PhD student in Intelligent Systems at the University of Pittsburgh http://www.cs.pitt.edu/~amruta/ • Anagha Kulkarni (M.S., 2006, expected) • Enhancing SenseClusters since Fall 2004! • http://www.d.umn.edu/~kulka020/ • National Science Foundation (USA) for supporting Bano, Amruta, Anagha and me (!) via CAREER award #0092784 EuroLAN-2005 Summer School
Practical Session • Experiment with SenseClusters • http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi • Has both a command line and web interface (above) • Can be installed on Linux/Unix machine without too much work • http://senseclusters.sourceforge.net • Has some dependencies that must be installed, so having supervisor access and/or sysadmin experience helps • Complete system (SenseClusters plus dependencies) is available on CD EuroLAN-2005 Summer School
Background and Motivations EuroLAN-2005 Summer School
Headed and Headless Contexts • A headed context includes a target word • Our goal is to collect multiple contexts that mention a particular target word in order to try identify different senses of that word • A headless context has no target word • Our goal is to identify the contexts that are similar to each other EuroLAN-2005 Summer School
Headed Contexts (input) • I can hear the ocean in that shell. • My operating system shell is bash. • The shells on the shore are lovely. • The shell command line is flexible. • The oyster shell is very hard and black. EuroLAN-2005 Summer School
Headed Contexts (output) • Cluster 1: • My operating system shell is bash. • The shell command line is flexible. • Cluster 2: • The shells on the shore are lovely. • The oyster shell is very hard and black. • I can hear the ocean in that shell. EuroLAN-2005 Summer School
Headless Contexts (input) • The new version of Linux is more stable and better support for cameras. • My Chevy Malibu has had some front end troubles. • Osborne made on of the first personal computers. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EuroLAN-2005 Summer School
Headless Contexts (output) • Cluster 1: • The new version of Linux is more stable and better support for cameras. • Osborne made one of the first personal computers. • Cluster 2: • My Chevy Malibu has had some front end troubles. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EuroLAN-2005 Summer School
Applications • Web search results are headed contexts • Term you search for is included in snippet • Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” • Organizing web search results is an important problem. • If you click on search results or follow links in pages found, you will encounter headless contexts too… EuroLAN-2005 Summer School
Applications • Email (public or private) is made up of headless contexts • Short, usually focused… • Cluster similar email messages together • Automatic email foldering • Take all messages from sent-mail file or inbox and organize into categories EuroLAN-2005 Summer School
Applications • News article are another example of headless contexts • Entire article or first paragraph • Short, usually focused • Cluster similar articles together EuroLAN-2005 Summer School
Underlying Premise… • You shall know a word by the company it keeps • Firth, 1957 (Studies in Linguistic Analysis) • Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) • Harris, 1968 (Mathematical Structures of Language) • Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) • Miller and Charles, 1991 (Language and Cognitive Processes) • Various extensions… • Similar contexts will have similar meanings, etc. • Names that occur in similar contexts will refer to the same underlying person, etc. EuroLAN-2005 Summer School
Identifying Lexical Features Measures of Association and Tests of Significance EuroLAN-2005 Summer School
What are features? • Features represent the (hopefully) salient characteristics of the contexts to be clustered • Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features • Vectors/contexts that include many of the same features will be similar to each other EuroLAN-2005 Summer School
Where do features come from? • In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered • This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection • It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step • Email or news articles EuroLAN-2005 Summer School
Feature Selection • “Test” data – the contexts to be clustered • Assume that the feature selection data is the same as the test data, unless otherwise indicated • “Training” data – a separate corpus of held out feature selection data (that will not be clustered) • may need to use if you have a small number of contexts to cluster (e.g., web search results) • This sense of “training” due to Schütze (1998) EuroLAN-2005 Summer School
Lexical Features • Unigram – a single word that occurs more than a given number of times • Bigram – an ordered pair of words that occur together more often than expected by chance • Consecutive or may have intervening words • Co-occurrence – an unordered bigram • Target Co-occurrence – a co-occurrence where one of the words is the target word EuroLAN-2005 Summer School
Bigrams • fine wine (window size of 2) • baseball bat • house of representatives (window size of 3) • president of the republic (window size of 4) • apple orchard • Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?) EuroLAN-2005 Summer School
Co-occurrences • tropics water • boat fish • law president • train travel • Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations EuroLAN-2005 Summer School
Bigrams and Co-occurrences • Pairs of words tend to be much less ambiguous than unigrams • “bank” versus “river bank” and “bank card” • “dot” versus “dot com” and “dot product” • Three grams and beyond occur much less frequently (Ngrams very Zipfian) • Unigrams are noisy, but bountiful EuroLAN-2005 Summer School
“occur together more often than expected by chance…” • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix • Throw out bigrams that include one or two stop words • Expected values are calculated, based on the model of independence and observed values • How often would you expect these words to occur together, if they only occurred together by chance? • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. EuroLAN-2005 Summer School
2x2 Contingency Table EuroLAN-2005 Summer School
2x2 Contingency Table EuroLAN-2005 Summer School
2x2 Contingency Table EuroLAN-2005 Summer School
Measures of Association EuroLAN-2005 Summer School
Measures of Association EuroLAN-2005 Summer School
Interpreting the Scores… • G^2 and X^2 are asymptotically approximated by the chi-squared distribution… • This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … EuroLAN-2005 Summer School
Interpreting the Scores… • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis • H0: the words in the bigram are independent • 3.841 is associated with 95% confidence that the null hypothesis should be rejected EuroLAN-2005 Summer School
Measures of Association • There are numerous measures of association that can be used to identify bigram and co-occurrence features • Many of these are supported in the Ngram Statistics Package (NSP) • http://www.d.umn.edu/~tpederse/nsp.html EuroLAN-2005 Summer School
Measures Supported in NSP • Log-likelihood Ratio (ll) • True Mutual Information (tmi) • Pearson’s Chi-squared Test (x2) • Pointwise Mutual Information (pmi) • Phi coefficient (phi) • T-test (tscore) • Fisher’s Exact Test (leftFisher, rightFisher) • Dice Coefficient (dice) • Odds Ratio (odds) EuroLAN-2005 Summer School
NSP • Will explore NSP during practical session • Integrated into SenseClusters, may also be used in stand-alone mode • Can be installed easily on a Linux/Unix system from CD or download from • http://www.d.umn.edu/~tpederse/nsp.html • I’m told it can also be installed on Windows (via cygwin or ActivePerl), but I have no personal experience of this… EuroLAN-2005 Summer School
Summary • Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data • Language independent • Unigrams usually only selected by frequency • Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case • Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association • Bigrams and co-occurrences need not be consecutive • Stop words should be eliminated • Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful) EuroLAN-2005 Summer School
Related Work • Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf • Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) • Dunning, 1993 (Computational Linguistics) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf EuroLAN-2005 Summer School