Clustering

Clustering • Supervised vs. Unsupervised Learning • Examples of clustering in Web IR • Characteristics of clustering • Clustering algorithms • Cluster Labeling

Supervised vs. Unsupervised Learning • Supervised Learning • Goal: A program that performs a task as good as humans. • TASK – well defined (the target function) • EXPERIENCE – training data provided by a human • PERFORMANCE – error/accuracy on the task • Unsupervised Learning • Goal: To find some kind of structure in the data. • TASK – vaguely defined • No EXPERIENCE • No PERFORMANCE (but, there are some evaluations metrics)

What is Clustering? • Clustering is the most common form of Unsupervised Learning • Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects • It can be used in IR: • To improve recall in search applications • For better navigation of search results

Example 1: Improving Recall • Cluster hypothesis - Documents with similar text are related • Thus, when a query matches a document D, also return other documents in the cluster containing D.

Example 2: Better Navigation

Clustering Characteristics • Flat versus Hierarchical Clustering • Flat means dividing objects in groups (clusters) • Hierarchical means organize clusters in a subsuming hierarchy • Evaluating Clustering • Internal Criteria • The intra-cluster similarity is high (tightness) • The inter-cluster similarity is low (separateness) • External Criteria • Did we discover the hidden classes? (we need gold standard data for this evaluation)

Clustering for Web IR • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small

Recall documents as vectors • Each doc j is a vector of tfidf values, one component for each term. • Can normalize to unit length. • So we have a vector space • terms are axes - aka features • n docs live in this space • even with stemming, may have 20,000+ dimensions

What makes documents related? • Ideal: semantic similarity. • Practical: statistical similarity • We will use cosine similarity. • Documents as vectors. • We will describe algorithms in terms of cosine similarity. This is known as the normalized inner product.

Intuition for relatedness D2 D3 D1 x y t 1 t 2 D4 Documents that are “close together” in vector space talk about the same things.

Clustering Algorithms • Partitioning “flat” algorithms • Usually start with a random (partial) partitioning • Refine it iteratively • k-means clustering • Model based clustering (we will not cover it) • Hierarchical algorithms • Bottom-up, agglomerative • Top-down, divisive (we will not cover it)

Partitioning “flat” algorithms • Partitioning method: Construct a partition of n documents into a set of k clusters • Given: a set of documents and the number k • Find: a partition of k clusters that optimizes the chosen partitioning criterion Watch animation of k-means

K-means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: • Reassignment of instances to clusters is based on distance to the current cluster centroids.

K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cjsuch that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj= (cj)

K-means: Different Issues • When to stop? • When a fixed number of iterations is reached • When centroid positions do not change • Seed Choice • Results can vary based on random seed selection. • Try out multiple starting points Example showing sensitivity to seeds If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} B A C D E F

animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean Hierarchical clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

Hierarchical Agglomerative Clustering • We assume there is a similarity function that determines the similarity of two instances. Algorithm: Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj Watch animation of HAC

What is the most similar cluster? • Single-link • Similarity of the most cosine-similar (single-link) • Complete-link • Similarity of the “furthest” points, the least cosine-similar • Group-average agglomerative clustering • Average cosine between pairs of elements • Centroid clustering • Similarity of clusters’ centroids

Single link clustering 1) Use maximum similarity of pairs: 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Complete link clustering 1) Use minimum similarity of pairs: 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Major issue - labeling • After clustering algorithm finds clusters - how can they be useful to the end user? • Need a concise label for each cluster • In search results, say “Animal” or “Car” in the jaguar example. • In topic trees (Yahoo), need navigational cues. • Often done by hand, a posteriori.

How to Label Clusters • Show titles of typical documents • Titles are easy to scan • Authors create them for quick scanning! • But you can only show a few titles which may not fully represent cluster • Show words/phrases prominent in cluster • More likely to fully represent cluster • Use distinguishing words/phrases • But harder to scan

Not covered in this lecture • Complexity: • Clustering is computationally expensive. Implementations need careful balancing of needs. • How to decide how many clusters are best? • Evaluating the “goodness” of clustering • There are many techniques, some focus on implementation issues (complexity/time), some on the quality of

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering