Distributional Clustering of English Words

Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy, juramos@cs.rutgers.edu

Overview • Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts. • Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.

Applications • Scientific POV: lexical acquisition of words • Practical POV: classification concerns data sparseness in garmmar models. • Address clusters in large corpus of documents

Definitions • Context: function of given word in its sentence. • Eg: a noun as a direct object • Sense class: hidden model describing word association tendencies • Mix of cluster and cluster probability given a word • Cluster: probabilistic concept of a sense class

Problem Setting • Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship • f (v, n) = frequencies of occurrence of verb, noun pairs • Text must be pre-formatted to fit specifications • For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))

Problem Setting cont. • Goal: create set C of clusters and probabilityies p(c|n). • Each c in C associated to cluster centroid p(c) • p(c) = average of p(n) over all v in V.

Distributional Similarity • Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x))) • D(p || q) = 0 implies p = q • Small D(p || q) implies two distributions are likely instances of a centroid p(c). • D(p || q) measures loss of data by using p(c).

Theoretical Foundation • Given unstructured V, N, training data of X independent pairs of verbs and nouns. • Problem: learn joint distribution of pairs given X • Not quite unsupervised, not quite supervised • No internal structure in pairs • Learn underlying distribution

Distributional Clustering • Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)). • p(c|n) = membership probability of n in c • p(c,v) = p(v|c) = probability of v given centroid for c • Assuming p(n), p’(v) coincide, p’(n,v) = sum(c in C, p(c)*p(n|c)*p(v|c))

Maximum Likelihood Cluster Centroids • Used to maximize goodness of fit between data and p’(n,v) • For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))). • Maximize according to p(n|c) and p(v|c). • Variation of l(S):

Entropy Cluster Membership cont. • Average cluster distortion: • Entropy:

Entropy Cluster Membership cont. • Class and membership distributions: • Z(c) and Z(n) are normalization sums • Previous equations reduce log-likelihood to: • At maximum, variation vanishes

KL Distortion • Attempt to minimize KL distortion through variation of KL distances: • Results in weighted average of noun distributions.

Free Energy Function • Combined minimum distortion and max entropy equivalent to minimum of free energy: F = <D> - H/beta • F determines <D> and H through partial derivatives: • Min of F determines balance between disordering max entropy and ordering distortion min.

Hierarchical Clustering • Number of clusters is determined through sequence of increases of beta. • Higher beta implies more local influence of noun on definition of centroids. • Start with low beta and a single c in C • Search for lowest beta that splits c into two or more leaf c’s. • Repeat until |C| reaches desired size.

Experimental Results • Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = 2147. • Four words most similar to cluster centroid and KL distances for first splits. • Split 1: cluster of ‘fire’ as discharging weapons vs. cluster of ‘fire’ as releasing employees • Split 2: weapons as projectiles vs. weapons as guns.

Clustering on Verb ‘fire’

Evaluation

Evaluation cont.

Conclusions • Clustering is efficient, informative, and returns good predictions • Future work • Make clustering method more rigorous • Introduce human judgment, i.e. a more supervised approach • Extend model to other word relationships

References

References cont.

More References

Distributional Clustering of English Words

Distributional Clustering of English Words

Presentation Transcript

Distributional models

Multi-way Distributional Clustering via Pairwise Interactions

FOREIGN WORDS IN ENGLISH

English Lexicology Morphological Structure of English Words

On feature distributional clustering for text categorization

English Lexicology Morphological Structure of English Words

Some English words of Russian origin

Teaching English Words

Teaching English Words

What percentage of English words come from Greek words?

English IV Vocabulary Words

Distributional Consequences of Globalization

MORPHOLOGICAL STRUCTURE OF ENGLISH WORDS

English Food Vocab Words

On feature distributional clustering for text categorization

Distributional Clustering of Words for Text Classification

Distributional Effects of Globalization

English Lexicology Classification of English Words

Morphological structure of English words (WORDS) Lecture # 3

Distributional clustering of English words

English Christmas words

Root Words in English | List of Root Words | Englishleap.com