420 likes | 530 Views
Lecture 7: Clustering and Decomposition of a Document to Authorial Components. Moshe Koppel and Navot Akiva. Clustering. The idea is to define “natural” clusters Position in space depends on feature set Need to have some notion of what you want in order to choose right feature types
E N D
Lecture 7: Clustering and Decomposition of a Document to Authorial Components Moshe Koppel and NavotAkiva
Clustering • The idea is to define “natural” clusters • Position in space depends on feature set • Need to have some notion of what you want in order to choose right feature types • e.g., cluster by topic: content words • e.g., cluster by author: style features • With many features, clusters aren’t so elegant
How Many Clusters? • Some methods decide in advance • Some methods try to optimize k based on data • e.g., in previous example, would want 3 clusters • Caution: some optimization criteria biased towards one giant cluster or many tiny ones
Flat Clustering: K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. For each cluster cj: sj= (cj)(Update the seeds to the centroid of each cluster)
Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x K-Means Example (K=2) Reassign clusters Converged!
Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change.
Convergence • Why should the K-means algorithm ever reach a fixed point? • A state in which clusters don’t change. • K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. • EM is known to converge. • Number of iterations could be large.
Convergence of K-Means • Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk = Σi (di – ck)2 (sum over all di in cluster k) G = Σk Gk • Reassignment monotonically decreases G since each vector is assigned to the closest centroid.
Dependence on Seed Choice • Results depend on random seed selection. • Some tricks for avoiding bad seeds: • Choose seed far from any already chosen • Try out multiple seed sets If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}
A Problem… • The methods we’ve discussed so far don’t always give very satisfying results. • Have a look at this example….
Spectral Methods • Because of examples like that, spectral methods have been developed. • Here’s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights. • Now make the “cheapest” cuts that result in separate components.
Spectral Clustering Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions: • Pre-processing • Construct a matrix representation of the dataset. • Decomposition • Compute eigenvalues and eigenvectors of the matrix. • Map each point to a lower-dimensional representation based on one or more eigenvectors. • Grouping • Assign points to two or more clusters, based on the new representation.
New Problem: Separating Document Components • Often documents consist of multiple authorial components. • Our object is to tease apart the components of a composite document.
Basic Idea • Divide the document into natural chunks (e.g., chapters, paragraphs) • Vectorize chunks using some feature set • Cluster the vectors
Classic Example: The Bible • Great historic and cultural interest • Much prior research on components • Has been manually tagged in every conceivable way
Test Case Let’s throw the chapters of Ezekiel and Jeremiah into a hat and see if we can separate them out. • Each is presumably the work of a single (primary) author. • There’s no reason to think it’s the same author. • They are thematically quite similar, though far from identical.
Clustering Jeremiah+Ezekiel • Chunks = chapters (no sequence info) • Features = bag of words • Cluster method = Ncut • # clusters = 2
Results using all words • Not too exciting. We must be picking up thematic or genre-related differences that cross books. • Let’s try using only function words.
Results using function words • Not any better. • Let’s try a new approach.
A Better Idea • Exploit the fact that different authors use different synonyms for the same idea (e.g., makel/mateh). • It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets. • Note that words aren’t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)
Automatically Finding Synonyms • There are various clever methods for identifying synsets, but most are not exact enough for our purpose. • Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging. • We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.) • Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)
Synonym Method • The usual similarity measures (e.g., cosine, inverse Euclidean distance) aren’t quite right. • If two docs use the same synonym, they are similar. • If two docs use opposite synonyms, they are different. • If one of the docs uses one of the synonyms, but the other doesn’t, cosine would regard them as different. But are they? • For measuring similarity, we only consider synsets represented in both docs.
Results using synonyms • Now we’re getting somewhere. • But we’re not done yet…
Core of a Cluster • Some chapters are in a cluster because they really belong there; some just have to be somewhere. • Let’s consider only chapters near one centroid and far from the other. These are the “cores” of the respective clusters. 36 2 12 3 36 7 4
Cluster cores Ezekiel 1, 10
Expanding the Core • Now that we have a core, we can use supervised methods (e.g., SVM) to learn a boundary. • In fact, we can use function words as our features. • Using synonyms will just get us back where we started. • And besides, FW are generally very reliable for supervised authorship attribution.
SVM expansion of core • Two training examples are “misclassed” by SVM. • Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary. • The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.
Decomposing Unsegmented Text • Until now, we’ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That’s a major unrealistic assumption. • Let’s create a munged “Jer-iel” book by alternately taking a random number of verses from each book until we run out of verses. • Can we tease that apart?
Chunks Might Be Mixed xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Chunks Might Be Mixed xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Decomposing Unsegmented Text • Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.) • Now proceed exactly as above, until you have cluster cores. • With a bit of luck, the chunks in the core will be the “pure” chunks.
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Decomposing Unsegmented Text • As above, use SVM to learn a classifier, using the core chunks as training examples. • But at the last step, classify individual verses instead of chapters. • Some smoothing techniques are necessary to: • Deal with the sparseness of features in individual verses • Take advantage of serial correlation
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Multiple Books • The same method works if we create a munged book out of k>2 constituent books. • BUT, we need to be given the k. • So we don’t yet know how many components there are, only how to best split into a given number of components.
Open Question • Can we extend this method so it can be used also for texts that don’t have canonical translations like the Bible? • Can we use this method to identify plagiarized sections of a text?