320 likes | 449 Views
CS728 Web Clustering II Lecture 14. K-Means. Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c : Reassignment of instances to clusters is based on distance to the current cluster centroids.
E N D
K-Means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: • Reassignment of instances to clusters is based on distance to the current cluster centroids. • (Or one can equivalently phrase it in terms of similarities)
K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cjsuch that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj)
Pick seeds Reassign clusters Compute centroids Reassign clusters x x x Compute centroids x x x K Means Example(K=2) Reassign clusters Converged!
Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change. Does this mean that the docs in a cluster are unchanged?
Convergence • Why should the K-means algorithm ever reach a fixed point? • A state in which clusters don’t change. • K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. • EM is known to converge. • Number of iterations could be large.
Convergence of K-Means • Define goodness measure of cluster k as sum of squared distances from cluster centroid: • Gk = Σi (vi – ck)2 (sum all vi in cluster k) • G = Σk Gk • Reassignment monotonically decreases G since each vector is assigned to the closest centroid. • Recomputation monotonically decreases each Gk since: (mk is number of members in cluster) • Σ (vin – a)2 reaches minimum for: • Σ –2(vin – a) = 0
Convergence of K-Means • Σ –2(vin – a) = 0 • Σ vin = Σ a • mk a = Σ vin • a = (1/ mk) Σ vin = ckn • K-means typically converges quite quickly • But, convergence only to local minimum
Linear Time Complexity • Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. • Reassigning clusters: O(kn) distance computations, or O(knm). • Computing centroids: Each instance vector gets added once to some centroid: O(nm). • Assume these two steps are each done once for i iterations: O(iknm). • Linear in all relevant factors, assuming a fixed number of iterations, more efficient than hierarchical agglomerative methods
Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic (e.g., doc least similar to any existing mean) • Try out multiple starting points • Initialize with the results of another method. Example showing sensitivity to seeds In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Exercise: find good approach for finding good starting points
How Many Clusters? • Number of clusters k is given • Partition n docs into predetermined number of clusters • Finding the “right” number of clusters is part of the problem • Given docs, partition into an “appropriate” number of subsets. • E.g., for query results - ideal value of k not known up front - though UI may impose limits. • Can usually take an algorithm for one flavor and convert to the other.
k not specified in advance • Say, the results of a query. • Solve an optimization problem: penalize having lots of clusters • application dependent, e.g., compressed summary of search results list. • Tradeoff between having more clusters (better focus within each cluster) and having too many clusters
k not specified in advance • Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid • Define the Total Benefit to be the sum of the individual doc Benefits. Why is there always a clustering of Total Benefit n?
Penalize lots of clusters • For each cluster, we have a CostC. • Thus for a clustering with k clusters, the Total Cost is kC. • Define the Value of a clustering to be = Total Benefit - Total Cost. • Find the clustering of highest value, over all choices of k. • Total benefit increases with increasing K. But can stop when it doesn’t increase by “much”. The Cost term enforces this.
K-means issues, variations, etc. • Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means • Assumes clusters are spherical in vector space • Sensitive to coordinate changes, weighting etc. • Disjoint and exhaustive • Doesn’t have a notion of “outliers”
Soft Clustering • Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. • Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. • Soft clustering gives probabilities that an instance belongs to each of a set of clusters. • Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1).
Expectation Maximization-Background Assume data comes from distribution model. How to classify points and estimate parameters of the models in a mixture at the same time? (Chicken and egg problem) In mixture of models, two targets are twisted: • The parameters of the models • The assignment of each data point to the process that generate it
Intuition behind EM • Each of the step is easy assuming the other is solved • Know the assignment of each data points, we can estimate the parameters • Know the parameters of the distributions, we can assign each point to a model ( eg. by MLE) • This is what K-Means does
Key Factor in EM • Adaptive hard clustering: k-mean. Assign at each point to only one class at each step. • Adaptive soft clustering: EM. Data is assigned to each class with a probability equal to the relative likelihood of that point belonging to the class.
Structure of EM Algorithm • Really a large class of algorithms • Initialization: Pick start values for parameters • Iteratively process until parameters converge • Expectation (E) step: Calculate weights for every data point by running the responsibilities (weights) • Maximization (M) step: Maximize a loglikelihood function with the weights given by E step to update the parameters of the models
Model based clustering and EM • Gives a soft variant of the K-means algorithm • Assume k clusters: {c1, c2,… ck} • Assume a probabilistic model of categories that allows computing P(ci | D)for each category, ci, for a given example document, D. • For text, typically assume a naïve Bayes category model. • Model Parameters • P(ci) – percentage of docs in class • P(wj | ci) – chance of seeing word in doc in class ci • Naïve Bayes assumes conditional independence to simplify combined evidence calculations
EM Algorithm for text clustering • Iterative method for learning probabilistic categorization model from unsupervised data. • Initially assume random assignment of docs to categories. • Learn an initial probabilistic model by estimating model parameters from this randomly labeled data. • Iterate following two steps until convergence: • Expectation (E-step): Compute P(ci | D)for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. • Maximization (M-step): Re-estimate the model parameters, , from the probabilistically re-labeled data.
EM Experiment on Web Docs [Soumen Chakrabarti] • Semi-supervised: corpus of labeled and unlabeled data • Take labeled corpus D, and randomly select a subset as DK as a test set. • Use the set of unlabeled documents in the EM procedure. • Correct classification of a document <=> concealed class label = class with largest probability • Accuracy with unlabeled documents > accuracy without unlabeled documents • Keeping labeled set of same size • EM beats (supervised) naïve Bayes with same size of labeled document set • Largest boost for small size of labeled set • Comparable or poorer performance of EM for large labeled sets
Increasing DUwhile holding DK fixed also shows the advantage of using large unlabeled sets in the EM-like algorithm. Purity
Summary • Covered two types of clustering • Flat, partitional clustering • Hierarchical, agglomerative clustering • Not covered – Spectral clustering based on eigenvectors • How many clusters? • Key issues • Representation of data points • Similarity/distance measure • HAC – simple but requires n^2 distances • K-means: the basic partitional algorithm – linear time • Model-based clustering and EM estimation
Scientific evaluation of clustering • Perhaps the most substantive issue in CS research, and clustering in particular: • how do you measure goodness? • Most measures focus on computational efficiency • Time and space • For application of clustering to web search: • Measure retrieval effectiveness
Approaches to evaluation • Anecdotal • User inspection • Ground “truth” comparison • Cluster retrieval • Purely quantitative measures • Probability of generating clusters found • Average distance between cluster members • Microeconomic / utility
Anecdotal evaluation • Probably the commonest (and surely the easiest) • “I wrote this clustering algorithm and look what it found!” • No benchmarks, no comparison possible • Any clustering algorithm will pick up the easy stuff like partition by languages • Generally, unclear scientific value.
User inspection • Induce a set of clusters or a navigation tree • Have subject matter experts evaluate the results and score them • some degree of subjectivity • Often combined with search results clustering • Not clear how reproducible across tests. • Expensive / time-consuming
Ground “truth” comparison • Take a union of docs from a taxonomy & cluster • Yahoo!, ODP, newspaper sections … • Compare clustering results to baseline • e.g., 80% of the clusters found map “cleanly” to taxonomy nodes • How would we measure this? • But is it the “right” answer? • There can be several equally right answers • For the docs given, the static prior taxonomy may be incomplete/wrong in places • the clustering algorithm may have gotten right things not in the static taxonomy “Subjective”
Ground truth comparison • Divergent goals • Static taxonomy designed to be the “right” navigation structure • somewhat independent of corpus at hand • Clusters found have to do with vagaries of corpus • Also, docs put in a taxonomy node may not be the most representative ones for that topic • cf Yahoo!
Microeconomic viewpoint • Any algorithm - including clustering - is only as good as the economic utility it provides • For clustering: net economic gain produced by an approach (vs. another approach) • Strive for a concrete optimization problem • Carefully chosen performance metric • Examples • recommendation systems • need similarity/distance measure • clock time for interactive search • Expensive to test