240 likes | 388 Views
Distributional Clustering of English Words. Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University
E N D
Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy, juramos@cs.rutgers.edu
Overview • Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts. • Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.
Applications • Scientific POV: lexical acquisition of words • Practical POV: classification concerns data sparseness in garmmar models. • Address clusters in large corpus of documents
Definitions • Context: function of given word in its sentence. • Eg: a noun as a direct object • Sense class: hidden model describing word association tendencies • Mix of cluster and cluster probability given a word • Cluster: probabilistic concept of a sense class
Problem Setting • Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship • f (v, n) = frequencies of occurrence of verb, noun pairs • Text must be pre-formatted to fit specifications • For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))
Problem Setting cont. • Goal: create set C of clusters and probabilityies p(c|n). • Each c in C associated to cluster centroid p(c) • p(c) = average of p(n) over all v in V.
Distributional Similarity • Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x))) • D(p || q) = 0 implies p = q • Small D(p || q) implies two distributions are likely instances of a centroid p(c). • D(p || q) measures loss of data by using p(c).
Theoretical Foundation • Given unstructured V, N, training data of X independent pairs of verbs and nouns. • Problem: learn joint distribution of pairs given X • Not quite unsupervised, not quite supervised • No internal structure in pairs • Learn underlying distribution
Distributional Clustering • Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)). • p(c|n) = membership probability of n in c • p(c,v) = p(v|c) = probability of v given centroid for c • Assuming p(n), p’(v) coincide, p’(n,v) = sum(c in C, p(c)*p(n|c)*p(v|c))
Maximum Likelihood Cluster Centroids • Used to maximize goodness of fit between data and p’(n,v) • For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))). • Maximize according to p(n|c) and p(v|c). • Variation of l(S):
Maximum Entropy Cluster Membership • Assume independence between variations of p(n|c) and p(v|c). • Can find Bayes inverses of p(n|c) given p(v|c) and p(v|n) • p(v|c) that maximize l(S) also minimize average distortion between cluster model and data
Entropy Cluster Membership cont. • Average cluster distortion: • Entropy:
Entropy Cluster Membership cont. • Class and membership distributions: • Z(c) and Z(n) are normalization sums • Previous equations reduce log-likelihood to: • At maximum, variation vanishes
KL Distortion • Attempt to minimize KL distortion through variation of KL distances: • Results in weighted average of noun distributions.
Free Energy Function • Combined minimum distortion and max entropy equivalent to minimum of free energy: F = <D> - H/beta • F determines <D> and H through partial derivatives: • Min of F determines balance between disordering max entropy and ordering distortion min.
Hierarchical Clustering • Number of clusters is determined through sequence of increases of beta. • Higher beta implies more local influence of noun on definition of centroids. • Start with low beta and a single c in C • Search for lowest beta that splits c into two or more leaf c’s. • Repeat until |C| reaches desired size.
Experimental Results • Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = 2147. • Four words most similar to cluster centroid and KL distances for first splits. • Split 1: cluster of ‘fire’ as discharging weapons vs. cluster of ‘fire’ as releasing employees • Split 2: weapons as projectiles vs. weapons as guns.
Conclusions • Clustering is efficient, informative, and returns good predictions • Future work • Make clustering method more rigorous • Introduce human judgment, i.e. a more supervised approach • Extend model to other word relationships