CS728 Clustering the Web Lecture 13

CS728Clusteringthe WebLecture 13

What is clustering? • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects • It is the commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where the correct classification of examples is given • It is a common and important task that finds many applications in Web Science, IR and other places

Why cluster web documents? • Whole web navigation • Better user interface • Improving recall in web search • Better search results • For better navigation of search results • Effective “user recall” will be higher • For speeding up retrieval • Faster search

Yahoo! Tree Hierarchy www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity CS Research Question: Given a set of related objects (e.g., webpages) find the best tree decomposition. Best: helps a user find/retrieve object of interest

Scatter/Gather Method for Browsing a Large Collection (SIGIR ’92) Cutting, Karger, Pedersen, and TukeyUsers browse a document collection interactively by selecting subsets of documents that are re-clustered on-the-fly

For improving search recall • Cluster hypothesis - Documents with similar text are related • Therefore, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Hope if we do this: The query “car” will also return docs containing automobile • Because clustering grouped together docs containing car with those containing automobile.

For better navigation of search results • For grouping search results thematically • clusty.com / Vivisimo

For better navigation of search results • And more visually: Kartoo.com

Defining What Is Good Clustering • Internal criterion: A good clustering will produce high quality clusters in which: • the intra-class (that is, intra-cluster) similarity is high • the inter-class similarity is low • The measured quality of a clustering depends on both the document representation and the similarity measure used • External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes • Assessable with gold standard data

External Evaluation of Cluster Quality • Assesses clustering with respect to ground truth • Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with nimembers. • Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi • Others are entropy of classes in clusters (or mutual information between classes and clusters)

Purity          Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Issues for clustering • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small • In an application, if a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much.

What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • We will typically use cosine similarity (Pearson correlation). • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We will describe algorithms in terms of cosine similarity.

Recall doc as vector • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 20,000+ dimensions • do we really want to use all terms? • Different from using vector space for search. Why?

Intuition t 3 D2 D3 D1 x y t 1 t 2 D4 Postulate: Documents that are “close together” in vector space talk about the same things.

Clustering Algorithms • Partitioning “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k means/medoids clustering • Model based clustering • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive

k-Clustering Algorithms • Given: a set of documents and the number k • Find: a partition into k clusters that optimizes the chosen partitioning criterion • Globally optimal: exhaustively enumerate all partitions • Effective heuristic methods: • Iterative k-means and k-medoids algorithms • Hierarchical methods – stop at level with k parts

How hard is clustering? • One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties • Suppose we are given n points, and would like to cluster them into k-clusters • How many possible clusterings?

spacing k = 4 Clustering Criteria: Maximum Spacing • Spacing between clusters is defined as Min distance between any pair of points in different clusters. • Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing.

Greedy Clustering Algorithm • Single-link k-clustering algorithm. • Form a graph on the vertex set U, corresponding to n clusters. • Find the closest pair of objects such that each object is in a different cluster, and add an edge between them. • Repeat n-k times until there are exactly k clusters. • Key observation. This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components). • Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

Ct Cs C*r p q pj pi Greedy Clustering Analysis • Theorem. Let C* denote the clustering C*1, …, C*k formed by deleting the k-1 most expensive edges of a MST. C* is a k-clustering of max spacing. • Pf. Let C denote some other clustering C1, …, Ck. • The spacing of C* is the length d* of the (k-1)st most expensive edge. • Since C is not C* there is pair pi, pj in the same cluster in C*, say C*r, but different clusters in C, say Cs and Ct. • Some edge (p, q) on pi-pj path in C*r spans two different clusters in C. • All edges on pi-pj path have length  d*since Kruskal chose them. • Spacing of C is  d* since p and qare in different clusters. ▪

Hierarchical Agglomerative Clustering (HAC) • Greedy is one example of HAC • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy.

A Dendogram: Hierarchical Clustering • Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). • Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

Hierarchical Clustering algorithms • Agglomerative (bottom-up): • Start with each document being a single cluster. • Eventually all documents belong to the same cluster. • Divisive (top-down): • Start with all documents belong to the same cluster. • Eventually each node forms a cluster on its own. • Does not require the number of clusters k in advance • Needs a termination/readout condition • The final mode in both Agglomerative and Divisive is of no use.

d3,d4,d5 d4,d5 d3 Dendrogram: Document Example • As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d4 d2 d1,d2

Hierarchical Clustering • Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest? • Max spacing • Measure intercluster distances by distances of nearest pairs. • Euclidean spacing • each cluster has a centroid = average of its points. • Measure intercluster distances by distances of centroids.

“Closest pair” of clusters • Many variants to defining closest pair of clusters • Single-link or max spacing • Similarity of the most similar pair (same as Kruskal’s MST algorithm) • “Center of gravity” • Clusters whose centroids (centers of gravity) are the most similar • Average-link • Average similarity between pairs of elements • Complete-link • Similarity of the “furthest” points, the least similar

Impact of Choice of Similarity Measure • Single Link Clustering • Can result in “straggly” (long and thin) clusters due to chaining effect. • Appropriate in some domains, such as clustering islands: “Hawaii clusters” • Uses min distance / max similarity update • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Single Link Example

Complete-Link Clustering • Use minimum similarity (max distance) of pairs: • Makes “tighter,” spherical clusters that are sometimes preferable. • After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Complete Link Example

Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • Since we can just store unchanged similarities • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. • Else O(n2log n) or O(n3) if done naively

Key notion: cluster representative • We want a notion of a representative point in a cluster • Representative should be some sort of “typical” or central point in the cluster, e.g., • point inducing smallest radii to docs in cluster • smallest squared distances, etc. • point that is the “average” of all docs in the cluster • Centroid or center of gravity

Centroid after second step. Centroid after first step. Example: n=6, k=3, closest pair of centroids d4 d6 d3 d5 d1 d2

Centroid Outliers in centroid computation • Can ignore outliers when computing centroid. • What is an outlier? • Lots of statistical definitions, e.g. • moment of point to centroid > M  some cluster moment. Outlier

Group-Average Clustering • Uses average similarity across all pairs within the merged cluster to measure the similarity of two clusters. • Compromise between single and complete link. • Two options: • Averaged across all ordered pairs in the merged cluster • Averaged over all pairs between the two original clusters • Some previous work has used one of these options; some the other. No clear difference in efficacy

Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length. • Always maintain sum of vectors in each cluster. • Compute similarity of clusters in constant time:

Medoid As Cluster Representative • The centroid does not have to be a document. • Medoid: A cluster representative that is one of the documents • For example: the document closest to the centroid • One reason this is useful • Consider the representative of a large cluster (>1000 documents) • The centroid of this cluster will be a dense vector • The medoid of this cluster will be a sparse vector • Compare: mean/centroid vs. median/medoid

Homework Exercise • Consider different agglomerative clustering methods of n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

Efficiency: “Using approximations” • In standard algorithm, must find closest pair of centroids at each step • Approximation: instead, find nearly closest pair • use some data structure that makes this approximation easier to maintain • simplistic example: maintain closest pair based on distances in projection on a random line Random line

The dual space • So far, we clustered docs based on their similarities in term space • For some applications, e.g., topic analysis for inducing navigation structures, can “dualize”: • use docs as axes • represent users or terms as vectors • proximity based on co-occurrence of usage • now clustering users or terms, not docs

Next time • Iterative clustering using K-means • Spectral clustering using eigenvalues

CS728 Clustering the Web Lecture 13

CS728 Clustering the Web Lecture 13

Presentation Transcript

CS728 Lecture 5 Generative Graph Models and the Web

Web Document Clustering

Lecture 16: Clustering

Lecture 6 : Clustering

CS728: Internet Studies and Web Algorithms

LECTURE 22: CLUSTERING

LECTURE 27: CLUSTERING

Lecture 7: Clustering

Web Document Clustering

Lecture 8: Clustering

Clustering Web Queries

Lecture 16: Clustering

Clustering the Tagged Web

CS728 Web Clustering II Lecture 14

CS728 Web Indexes

Web Service Clustering

Web clustering Engines

CS728: Internet Studies and Web Algorithms

Lecture 16: Clustering

Trustworthy Semantic Web Lecture #13

Lecture 10 Clustering