2.01k likes | 2.02k Views
Learn about clustering similar contexts without syntactic information or manual resources. Explore applications like name discrimination and document clustering.
E N D
Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth tpederse@d.umn.edu http://www.d.umn.edu/~tpederse/SCTutorial.html IJCAI-2007 Tutorial
Language Independent Methods • Do not utilize syntactic information • No parsers, part of speech taggers, etc. required • Do not utilize dictionaries or other manually created lexical resources • Based on lexical features selected from corpora • Assumption: word segmentation can be done by looking for white spaces between strings • No manually annotated data, methods are completely unsupervised in the strictest sense IJCAI-2007 Tutorial
A Note on Tokenization • Default tokenization is white space separated strings • Can be redefined using regular expressions • e.g., character n-grams (4 grams) • any other valid regular expression IJCAI-2007 Tutorial
Clustering Similar Contexts • A context is a short unit of text • often a phrase to a paragraph in length, although it can be longer • Input: N contexts • Output: K clusters • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters IJCAI-2007 Tutorial
Applications • Headed contexts (focus on target word) • Name Discrimination • Word Sense Discrimination • Headless contexts • Email Organization • Document Clustering • Paraphrase identification • Clustering Sets of Related Words IJCAI-2007 Tutorial
Tutorial Outline • Identifying Lexical Features • First Order Context Representation • native SC : context as vector of features • Second Order Context Representation • LSA : context as average of vectors of contexts • native SC : context as average of vectors of features • Dimensionality reduction • Clustering • Hands-On Experience IJCAI-2007 Tutorial
SenseClusters • A free package for clustering contexts • http://senseclusters.sourceforge.net • SenseClusters Live! (Knoppix CD) • Perl components that integrate other tools • Ngram Statistics Package • CLUTO • SVDPACKC • PDL IJCAI-2007 Tutorial
Many thanks… • Amruta Purandare (M.S., 2004) • Now PhD student in Intelligent Systems at the University of Pittsburgh • http://www.cs.pitt.edu/~amruta/ • Anagha Kulkarni (M.S., 2006) • Now PhD student at the Language Technologies Institute at Carnegie-Mellon University • http://www.cs.cmu.edu/~anaghak/ • Ted, Amruta, and Anagha were supported by the National Science Foundation (USA) via CAREER award #0092784 IJCAI-2007 Tutorial
Background and Motivations IJCAI-2007 Tutorial
Headed and Headless Contexts • A headed context includes a target word • Our goal is to cluster the target word based on the surrounding contexts • The focus is on the target word and making distinctions among word meanings • A headless context has no target word • Our goal is to cluster the contexts based on their similarity to each other • The focus is on the context as a whole and making topic level distinctions IJCAI-2007 Tutorial
Headed Contexts (input) • I can hear the ocean in that shell. • My operating system shell is bash. • The shells on the shore are lovely. • The shell command line is flexible. • An oyster shell is very hard and black. IJCAI-2007 Tutorial
Headed Contexts (output) • Cluster 1: • My operating system shell is bash. • The shell command line is flexible. • Cluster 2: • The shells on the shore are lovely. • An oyster shell is very hard and black. • I can hear the ocean in that shell. IJCAI-2007 Tutorial
Headless Contexts (input) • The new version of Linux is more stable and better support for cameras. • My Chevy Malibu has had some front end troubles. • Osborne made one of the first personal computers. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! IJCAI-2007 Tutorial
Headless Contexts (output) • Cluster 1: • The new version of Linux is more stable and better support for cameras. • Osborne made one of the first personal computers. • Cluster 2: • My Chevy Malibu has had some front-end troubles. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! IJCAI-2007 Tutorial
Web Search as Application • Snippets returned via Web search are headed contexts since they include the search term • Name Ambiguity is a problem with Web search. Results mix different entities • Group results into clusters where each cluster is associated with a unique underlying entity • Pages found by following search results can also be treated as headless contexts IJCAI-2007 Tutorial
Name Discrimination IJCAI-2007 Tutorial
George Millers! IJCAI-2007 Tutorial
Email Foldering as Application • Email (public or private) is made up of headless contexts • Short, usually focused… • Cluster similar email messages together • Automatic email foldering • Take all messages from sent-mail file or inbox and organize into categories IJCAI-2007 Tutorial
Clustering News as Application • News articles are headless contexts • Entire article or first paragraph • Short, usually focused • Cluster similar articles together, can also be applied to blog entries and other shorter units of text IJCAI-2007 Tutorial
What is it to be “similar”? • You shall know a word by the company it keeps • Firth, 1957 (Studies in Linguistic Analysis) • Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) • Harris, 1968 (Mathematical Structures of Language) • Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) • Miller and Charles, 1991 (Language and Cognitive Processes) • Various extensions… • Similar contexts will have similar meanings, etc. • Names that occur in similar contexts will refer to the same underlying person, etc. IJCAI-2007 Tutorial
General Methodology • Represent contexts to be clustered using first or second order feature vectors • Lexical features • Reduce dimensionality to make vectors more tractable and/or understandable (optional) • Singular value decomposition • Cluster the context vectors • Find the number of clusters • Label the clusters • Evaluate and/or use the contexts! IJCAI-2007 Tutorial
Identifying Lexical Features Measures of Association and Tests of Significance IJCAI-2007 Tutorial
What are features? • Features are the salient characteristics of the contexts to be clustered • Each context is represented as a vector, where the dimensions are associated with features • Contexts that include many of the same features will be similar to each other IJCAI-2007 Tutorial
Feature Selection Data • The contexts to cluster (evaluation/test data) • We may need to cluster all available data, and not hold out any for a separate feature identification step • A separate larger corpus (training data), esp. if we cluster a very small number of contexts • local training – corpus made up of headed contexts • global training – corpus made up of headless contexts • Feature selection data may be either the evaluation/test data, or a separate held-out set of training data IJCAI-2007 Tutorial
Feature Selection Data • Test / Evaluation data : contexts to be clustered • Assume that the feature selection data is the test data, unless otherwise indicated • Training data – a separate corpus of held out feature selection data (that will not be clustered) • may need to use if you have a small number of contexts to cluster (e.g., web search results) • This sense of “training” due to Schütze (1998) • does not mean labeled • simply an extra quantity of text IJCAI-2007 Tutorial
Lexical Features • Unigram • a single word that occurs more than X times in feature selection data and is not in stop list • Stop list • words that will not be used in features • usually non-content words like the, and, or, it … • may be compiled manually • may be derived automatically from a corpus of text • any word that occurs in a relatively large percentage (>10-20%) of contexts may be considered a stop word IJCAI-2007 Tutorial
Lexical Features • Bigram • an ordered pair of words that may be consecutive, or have intervening words that are ignored • the pair occurs together more than X times and/or more often than expected by chance in feature selection data • neither word in the pair may be in stop list • Co-occurrence • an unordered bigram • Target Co-occurrence • a co-occurrence where one of the words is the target IJCAI-2007 Tutorial
Bigrams • Window Size of 2 • baseball bat, fine wine, apple orchard, bill clinton • Window Size of 3 • house ofrepresentatives, bottle of wine, • Window Size of 4 • president of the republic, whispering in the wind • Selected using a small window size (2-4 words) • Objective is to capture a regular or localized pattern between two words (collocation?) IJCAI-2007 Tutorial
Co-occurrences • president law • the president signed a bill into law today • that law is unjust, said the president • the president feels that the law was properly applied • Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations IJCAI-2007 Tutorial
Bigrams and Co-occurrences • Pairs of words tend to be much less ambiguous than unigrams • “bank” versus “river bank” and “bank card” • “dot” versus “dot com” and “dot product” • Three grams and beyond occur much less frequently (Ngrams very Zipfian) • Unigrams occur more frequently, but are noisy IJCAI-2007 Tutorial
“occur together more often than expected by chance…” • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix • Expected values are calculated, based on the model of independence and observed values • How often would you expect these words to occur together, if they only occurred together by chance? • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. IJCAI-2007 Tutorial
2x2 Contingency Table IJCAI-2007 Tutorial
2x2 Contingency Table IJCAI-2007 Tutorial
2x2 Contingency Table IJCAI-2007 Tutorial
Measures of Association IJCAI-2007 Tutorial
Measures of Association IJCAI-2007 Tutorial
Interpreting the Scores… • G^2 and X^2 are asymptotically approximated by the chi-squared distribution… • This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … IJCAI-2007 Tutorial
Interpreting the Scores… • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis • H0: the words in the bigram are independent • 3.84 is associated with 95% confidence that the null hypothesis should be rejected IJCAI-2007 Tutorial
Measures of Association • There are numerous measures of association that can be used to identify bigram and co-occurrence features • Many of these are supported in the Ngram Statistics Package (NSP) • http://www.d.umn.edu/~tpederse/nsp.html • NSP is integrated into SenseClusters IJCAI-2007 Tutorial
Measures Supported in NSP • Log-likelihood Ratio (ll) • True Mutual Information (tmi) • Pointwise Mutual Information (pmi) • Pearson’s Chi-squared Test (x2) • Phi coefficient (phi) • Fisher’s Exact Test (leftFisher) • T-test (tscore) • Dice Coefficient (dice) • Odds Ratio (odds) IJCAI-2007 Tutorial
Summary • Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data • Language independent • Unigrams usually only selected by frequency • Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case • Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association • Bigrams and co-occurrences need not be consecutive • Stop words should be eliminated • Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful) IJCAI-2007 Tutorial
References • Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf • Pedersen, Kayaalp, and Bruce. 1996 (AAAI) explanation of the exact conditional test, a stochastic simulation of exact tests. http://www.d.umn.edu/~tpederse/Pubs/aaai96-cmpl.pdf • Pedersen, 1996 (SCSUG) explanation of exact tests for collocation identification, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 • Dunning, 1993 (Computational Linguistics) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf IJCAI-2007 Tutorial
Context Representations First and Second Order Methods IJCAI-2007 Tutorial
Once features selected… • We will have a set of unigrams, bigrams, co-occurrences or target co-occurrences that we believe are somehow interesting and useful • We also have any frequency and measure of association score that have been used in their selection • Convert contexts to be clustered into a vector representation based on these features IJCAI-2007 Tutorial
Possible Representations • First Order Features • Native SenseClusters • each context represented by a vectors of features • Second Order Co-Occurrence Features • Native SenseClusters • each word in a context replaced by vector of co-occurring words and averaged together • Latent Semantic Analysis • each feature in a context replaced by vector of contexts in which it occurs and averaged together IJCAI-2007 Tutorial