Clustering

Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa

Objectives of Cluster Analysis • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Competing objectives Intra-cluster distances are minimized Inter-cluster distances are maximized The commonest form of unsupervised learning

Notion of similarity/distance • Ideal: semantic similarity • Practical: term-statistical similarity • Docs as vectors • Similarity = cosine similarity, LSH,…

Clustering Algorithms • Flat algorithms • Create a set of clusters • Usually start with a random (partial) partitioning • Refine it iteratively • K means clustering • Hierarchical algorithms • Create a hierarchy of clusters (dendogram) • Bottom-up, agglomerative • Top-down, divisive

Hard vs. soft clustering • Hard clustering: Each document belongs to exactly one cluster • More common and easier to do • Soft clustering: Each document can belong to more than one cluster. • Makes more sense for applications like creating browsable hierarchies • News is a proper example • Search results is another example

Flat & Partitioning Algorithms • Given: a set of n documents and the number K • Find: a partition in K clusters that optimizes the chosen partitioning criterion • Globally optimal • Intractable for many objective functions • Ergo, exhaustively enumerate all partitions • Locally optimal • Effective heuristic methods: K-means and K-medoids algorithms

Sec. 16.4 K-Means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c: • Reassignment of instances to clusters is based on distance to the current cluster centroids.

Sec. 16.4 K-Means Algorithm Select K random docs {s1, s2,… sK} as seed centroids. Until clustering converges (or other stopping criterion): For each doc di: Assign di to the cluster crsuch that dist(di, sr) is minimal. For each cluster cj sj = (cj)

Sec. 16.4 Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x K Means Example (K=2) Reassign clusters Converged!

Sec. 16.4 Termination conditions • Several possibilities, e.g., • A fixed number of iterations. • Doc partition unchanged. • Centroid positions don’t change.

Sec. 16.4 Time Complexity • The centroids are K • Each doc/centroid consists of M dimensions • Computing distance btw vectors is O(M) time. • Reassigning clusters: Each doc compared with all centroids, O(KNM) time. • Computing centroids: Each doc gets added once to some centroid, O(NM) time. Assume these two steps are each done once for I iterations: O(IKNM).

How Many Clusters? • Number of clusters K is given • Partition n docs into predetermined number of clusters • Finding the “right” number of clusters is part of the problem • Can usually take an algorithm for one flavor and convert to the other.

Bisecting K-means Variant of K-means that can produce a partitional or a hierarchical clustering SSE = Sum of SquaredError

Bisecting K-means Example

K-means Pros • Simple • Fast for low dimensional data • Good performance on globular data Cons • K-Means is restricted to data which has the notion of a center (centroid) • K-Means cannot handle non-globular data of different sizes and densities • K-Means will not identify outliers

Ch. 17 animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents • One approach: recursive application of a partitional clustering algorithm

Strengths of Hierarchical Clustering • No assumption of any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level

Sec. 17.1 Hierarchical Agglomerative Clustering (HAC) • Starts with each doc in a separate cluster • Then repeatedly joins the closest pairof clusters, until there is only one cluster. • The history of mergings forms a binary tree or hierarchy. • The closest pair drives the mergings, how is it defined ?

Sec. 17.2 Closest pair of clusters • Single-link • Similarity of the closest points, the most cosine-similar • Complete-link • Similarity of the farthest points, the least cosine-similar • Centroid • Clusters whose centroids are the closest (or most cosine-similar) • Average-link • Clusters whose average distance/cosine between pairs of elements is the smallest

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • Single link (MIN) • Complete link (MAX) • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to define Inter-Cluster Similarity   • MIN • MAX • Centroids • Average Proximity Matrix

p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity • MIN • MAX • Centroids • Average Proximity Matrix

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering