430 likes | 549 Views
CS728 Clustering the Web Lecture 13. What is clustering?. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects It is the commonest form of unsupervised learning
E N D
What is clustering? • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects • It is the commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where the correct classification of examples is given • It is a common and important task that finds many applications in Web Science, IR and other places
Why cluster web documents? • Whole web navigation • Better user interface • Improving recall in web search • Better search results • For better navigation of search results • Effective “user recall” will be higher • For speeding up retrieval • Faster search
Yahoo! Tree Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity CS Research Question: Given a set of related objects (e.g., webpages) find the best tree decomposition. Best: helps a user find/retrieve object of interest
Scatter/Gather Method for Browsing a Large Collection (SIGIR ’92) Cutting, Karger, Pedersen, and TukeyUsers browse a document collection interactively by selecting subsets of documents that are re-clustered on-the-fly
For improving search recall • Cluster hypothesis - Documents with similar text are related • Therefore, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: The query “car” will also return docs containing automobile • Because clustering grouped together docs containing car with those containing automobile.
For better navigation of search results • For grouping search results thematically • clusty.com / Vivisimo
For better navigation of search results • And more visually: Kartoo.com
Defining What Is Good Clustering • Internal criterion: A good clustering will produce high quality clusters in which: • the intra-class (that is, intra-cluster) similarity is high • the inter-class similarity is low • The measured quality of a clustering depends on both the document representation and the similarity measure used • External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes • Assessable with gold standard data
External Evaluation of Cluster Quality • Assesses clustering with respect to ground truth • Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with nimembers. • Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi • Others are entropy of classes in clusters (or mutual information between classes and clusters)
Purity Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
Issues for clustering • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small • In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.
What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • We will typically use cosine similarity (Pearson correlation). • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We will describe algorithms in terms of cosine similarity.
Recall doc as vector • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 20,000+ dimensions • do we really want to use all terms? • Different from using vector space for search. Why?
Intuition t 3 D2 D3 D1 x y t 1 t 2 D4 Postulate: Documents that are “close together” in vector space talk about the same things.
Clustering Algorithms • Partitioning “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k means/medoids clustering • Model based clustering • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive
k-Clustering Algorithms • Given: a set of documents and the number k • Find: a partition into k clusters that optimizes the chosen partitioning criterion • Globally optimal: exhaustively enumerate all partitions • Effective heuristic methods: • Iterative k-means and k-medoids algorithms • Hierarchical methods – stop at level with k parts
How hard is clustering? • One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties • Suppose we are given n points, and would like to cluster them into k-clusters • How many possible clusterings?
spacing k = 4 Clustering Criteria: Maximum Spacing • Spacing between clusters is defined as Min distance between any pair of points in different clusters. • Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing.
Greedy Clustering Algorithm • Single-link k-clustering algorithm. • Form a graph on the vertex set U, corresponding to n clusters. • Find the closest pair of objects such that each object is in a different cluster, and add an edge between them. • Repeat n-k times until there are exactly k clusters. • Key observation. This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components). • Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.
Ct Cs C*r p q pj pi Greedy Clustering Analysis • Theorem. Let C* denote the clustering C*1, …, C*k formed by deleting the k-1 most expensive edges of a MST. C* is a k-clustering of max spacing. • Pf. Let C denote some other clustering C1, …, Ck. • The spacing of C* is the length d* of the (k-1)st most expensive edge. • Since C is not C* there is pair pi, pj in the same cluster in C*, say C*r, but different clusters in C, say Cs and Ct. • Some edge (p, q) on pi-pj path in C*r spans two different clusters in C. • All edges on pi-pj path have length d*since Kruskal chose them. • Spacing of C is d* since p and qare in different clusters. ▪
Hierarchical Agglomerative Clustering (HAC) • Greedy is one example of HAC • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy.
A Dendogram: Hierarchical Clustering • Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj
Hierarchical Clustering algorithms • Agglomerative (bottom-up): • Start with each document being a single cluster. • Eventually all documents belong to the same cluster. • Divisive (top-down): • Start with all documents belong to the same cluster. • Eventually each node forms a cluster on its own. • Does not require the number of clusters k in advance • Needs a termination/readout condition • The final mode in both Agglomerative and Divisive is of no use.
d3,d4,d5 d4,d5 d3 Dendrogram: Document Example • As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d4 d2 d1,d2
Hierarchical Clustering • Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? • Max spacing • Measure intercluster distances by distances of nearest pairs. • Euclidean spacing • each cluster has a centroid = average of its points. • Measure intercluster distances by distances of centroids.
“Closest pair” of clusters • Many variants to defining closest pair of clusters • Single-link or max spacing • Similarity of the most similar pair (same as Kruskal’s MST algorithm) • “Center of gravity” • Clusters whose centroids (centers of gravity) are the most similar • Average-link • Average similarity between pairs of elements • Complete-link • Similarity of the “furthest” points, the least similar
Impact of Choice of Similarity Measure • Single Link Clustering • Can result in “straggly” (long and thin) clusters due to chaining effect. • Appropriate in some domains, such as clustering islands: “Hawaii clusters” • Uses min distance / max similarity update • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
Complete-Link Clustering • Use minimum similarity (max distance) of pairs: • Makes “tighter,” spherical clusters that are sometimes preferable. • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:
Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • Since we can just store unchanged similarities • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. • Else O(n2log n) or O(n3) if done naively
Key notion: cluster representative • We want a notion of a representative point in a cluster • Representative should be some sort of “typical” or central point in the cluster, e.g., • point inducing smallest radii to docs in cluster • smallest squared distances, etc. • point that is the “average” of all docs in the cluster • Centroid or center of gravity
Centroid after second step. Centroid after first step. Example: n=6, k=3, closest pair of centroids d4 d6 d3 d5 d1 d2
Centroid Outliers in centroid computation • Can ignore outliers when computing centroid. • What is an outlier? • Lots of statistical definitions, e.g. • moment of point to centroid > M some cluster moment. Outlier
Group-Average Clustering • Uses average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • Compromise between single and complete link. • Two options: • Averaged across all ordered pairs in the merged cluster • Averaged over all pairs between the two original clusters • Some previous work has used one of these options; some the other. No clear difference in efficacy
Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length. • Always maintain sum of vectors in each cluster. • Compute similarity of clusters in constant time:
Medoid As Cluster Representative • The centroid does not have to be a document. • Medoid: A cluster representative that is one of the documents • For example: the document closest to the centroid • One reason this is useful • Consider the representative of a large cluster (>1000 documents) • The centroid of this cluster will be a dense vector • The medoid of this cluster will be a sparse vector • Compare: mean/centroid vs. median/medoid
Homework Exercise • Consider different agglomerative clustering methods of n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?
Efficiency: “Using approximations” • In standard algorithm, must find closest pair of centroids at each step • Approximation: instead, find nearly closest pair • use some data structure that makes this approximation easier to maintain • simplistic example: maintain closest pair based on distances in projection on a random line Random line
The dual space • So far, we clustered docs based on their similarities in term space • For some applications, e.g., topic analysis for inducing navigation structures, can “dualize”: • use docs as axes • represent users or terms as vectors • proximity based on co-occurrence of usage • now clustering users or terms, not docs
Next time • Iterative clustering using K-means • Spectral clustering using eigenvalues