1.17k likes | 1.47k Views
Flat Clustering. Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti. Today ’ s Topic: Clustering. Document clustering Motivations Document representations Success criteria Clustering algorithms Partitional (Flat) Hierarchical (Tree).
E N D
Flat Clustering Adapted from Slides by Prabhakar Raghavan, Christopher Manning, Ray Mooney and Soumen Chakrabarti L16FlatCluster
Today’s Topic: Clustering • Document clustering • Motivations • Document representations • Success criteria • Clustering algorithms • Partitional (Flat) • Hierarchical (Tree) L16FlatCluster
What is clustering? • Clustering: the process of grouping a set of objects into classes of similar objects • Documents within a cluster should be similar. • Documents from different clusters should be dissimilar. • The commonest form of unsupervised learning • Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given • A common and important task that finds many applications in IR and other places L16FlatCluster
A data set with clear cluster structure • How would you design an algorithm for finding the three clusters in this case?
Classification vs. Clustering • Classification: supervisedlearning • Clustering: unsupervisedlearning • Classification: Classes are human-defined and part of the input to the learning algorithm. • Clustering: Clusters are inferred from the data without human input. • However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representationofdocuments, . . . 5
Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity L16FlatCluster
Navigational hierarchies: Manual vs. automatic creation • Note: Yahoo/MESH are not examples of clustering. • But they are well known examples for using a global hierarchy fornavigation. • Some examples for global navigation/exploration based on clustering: • Cartia • Themescapes • Google News 12
Google News: automatic clustering gives an effective news presentation metaphor
Selection Metrics • Google News taps into its own unique ranking signals, which include • user clicks, • the estimated authority of a publication in a particular topic (possibly taking location into account), • Freshness/recency, • geography and more. L16FlatCluster
S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated
Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 • How it works • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topicaltermsandtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”
For visualizing a document collection and its themes • Wise et al, “Visualizing the non-visual” PNNL • ThemeScapes, Cartia • [Mountain height = cluster size]
For improving search recall • Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs • Therefore, to improve search recall: • Cluster docs in corpus a priori • When a query matches a doc D, also return other docs in the cluster containing D • Example: The query “car” will also return docs containing automobile • Because clustering grouped together docs containing car with those containing automobile. Why might this happen?
Issues for clustering • Representation for clustering • Document representation • Vector space? Normalization? • Need a notion of similarity/distance • How many clusters? • Fixed a priori? • Completely data driven? • Avoid “trivial” clusters - too large or small L16FlatCluster
What makes docs “related”? • Ideal: semantic similarity. • Practical: statistical similarity • Docs as vectors. • For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. • We can use cosine similarity (alternatively, Euclidean Distance). L16FlatCluster
More Applications of clustering … • Image Processing • Cluster images based on their visual content • Web • Cluster groups of users based on their access patterns on webpages • Cluster webpages based on their content • Bioinformatics • Cluster similar proteins together (similarity w.r.t. chemical structure and/or functionality etc.) • …
Outliers • Outliers are objects that do not belong to any cluster or form clusters of very small cardinality • In some applications we are interested in discovering outliers, not clusters (outlier analysis) cluster outliers
Clustering Algorithms • Partitional (Flat) algorithms • Usually start with a random (partial) partition • Refine it iteratively • K means clustering • (Model based clustering) • Hierarchical (Tree) algorithms • Bottom-up, agglomerative • (Top-down, divisive) L16FlatCluster
Hard vs. soft clustering • Hard clustering: Each document belongs to exactly one cluster • More common and easier to do • Soft clustering: A document can belong to more than one cluster. • Makes more sense for applications like creating browsable hierarchies • You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes
Partitioning Algorithms • Partitioning method: Construct a partition of n documents into a set of K clusters • Given: a set of documents and the number K • Find: a partition of K clusters that optimizes the chosen partitioning criterion • Globally optimal: exhaustively enumerate all partitions • Effective heuristic methods: K-means and K-medoids algorithms L16FlatCluster
K-Means • Assumes documents are real-valued vectors. • Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c. • Reassignment of instances to clusters is based on distance to the current cluster centroids. • (Or one can equivalently phrase it in terms of similarities) L16FlatCluster
K-Means Algorithm Select K random docs {s1, s2,… sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cjsuch that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) L16FlatCluster
Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x K Means Example(K=2) Reassign clusters Converged! L16FlatCluster
Worked Example: Random selection of initial centroids • Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters 41