320 likes | 336 Views
Flat Clustering. Chien Chin Chen Department of Information Management National Taiwan University. What is Clustering (1/4). Clustering algorithms group a set of documents into subsets or clusters . Clusters are coherent internally, but different from each other. In other words …
E N D
Flat Clustering Chien Chin Chen Department of Information Management National Taiwan University
What is Clustering (1/4) • Clustering algorithmsgroup a set of documents into subsets or clusters. • Clusters are coherent internally, but different from each other. • In other words … • Documents within a cluster should be as similar as possible. • Documents in one cluster should be as dissimilar as possible from documents in other clusters. • Clustering is of unsupervised learning. • There is no human expert who has assigned documents to classes. • Or teach how to put documents in classes. • It is the distribution and makeup of the data that will determine cluster membership.
What is Clustering (2/4) • For example, there are three distinct clusters of points in the following figure.
What is Clustering (3/4) • The key input to a clustering algorithm is the distance measure. • Different distance measures give rise to different clusterings (the outcome of clustering). • The most popular distance measure – Euclidean distance. • Structure: • Flat clustering: create a flat set of clusters without any explicit structure that would relate clusters to each other. • Hierarchical clustering: create a hierarchy of clusters. • Hard/soft clustering: • Hard clustering: each document is a member of exactly one cluster. • Soft clustering: a document’s assignment is a distribution over all clusters.
What is Clustering (4/4) • Partitional clustering: • An alternative definition of hard clustering. • Each document belongs to exactly one cluster. • Exhaustive clustering: • Each document must be assigned to a cluster. • Non-exhaustive clustering – some document will be assigned to no cluster.
Clustering in Information Retrieval (1/6) • A general assumption when using clustering in information retrieval: • Clustering hypothesis – documents in the same cluster behave similarly with respect to relevance to information needs. • So … if there is a document from a cluster that is relevant to a search request, we assume that other documents from the same cluster are also relevant. • Many information retrieval applications benefit from clustering, for instance, search result clustering, cluster-based retrieval …
Clustering in Information Retrieval (2/6) • Search result clustering: • The default presentation of search results in information retrieval is simple list. • It is often easier to scan a few coherent groups than many individual documents. • Particularly useful if a search term has different word senses.
Clustering in Information Retrieval (3/6) • Better user interface – Scatter-Gather [SIGIR92]: • Cluster the whole collection to get groups of documents that the user can select or gather. • The selected groups are merged and the resulting set is again clustered. • The process is repeated until a cluster of interest is found. • Problems: • The generated clusters are not as neatly organized as a manually constructed hierarchical tree. • Finding descriptive labels for clusters automatically is a difficult problem.
Clustering in Information Retrieval (5/6) • News document clustering: • News reading is not really search, but rather a process of selecting a subset of stories about recent events. • Need to frequently re-compute the clustering to make sure that users can access the latest breaking stories. • DARPA TDT (Topic Detection and Tracking) project • Real world applications: Google News and Columbia NewsBlaster system.
Clustering in Information Retrieval (6/6) • Speed up search: • Search in vector space model amounts to finding the nearest neighbors to the query. • We have to compute the similarity of the query to every document in a large collection. • By clustering the collection … • We can find the clusters that are closest to the query and only consider documents from these clusters. • Since there are many fewer clusters than documents, finding the closest cluster is fast. • Moreover, documents matching a query are similar to each other, and tend to be in the same clusters.
Problem Statement (1/2) • Se can define the hard flat clustering as follows: • Given … • A set of documents D = {d1, …, dN}. • A desired number of clusters K. • An objective function that evaluates the quality of a clustering. • For instance, the average similarity between documents and their centroid. • We want … • Compute an assignmentγ: D {1, …, K} that minimizes(or maximizes) the objective function. • Sometimes, we require that none of the K clusters is empty.
Problem Statement (2/2) • For documents, the similarity is usually topic similarity or high values on the same dimensions in the vector space model. d1 d2 di … Have similar topic!! Quite different~~
Cardinality – The Number of Cluster • A difficult issue in clustering is determining the cardinality of a clustering (i.e., K). • The brute force solution would be to enumerate all possible clustering and pick the best. • However, there are exponentially many partitions, so is not feasible. • 1 + 2N + 3N + … • Often K is nothing more than a good guess based on experience or domain knowledge.
Starting Point • Most flat clustering algorithms (e.g., K-means) begin with an initial partitioning and refine the partition iteratively. • If the search starts at an unfavorable initial point, we may miss the global optimum. • Finding a good starting point is another important problem in flat clustering. global optimum local optimum objective function starting point partition
Evaluation of Clustering (1/6) • Internal criterion: • High intra-cluster similarity – documents within a cluster are similar. • Low inter-cluster similarity – documents from different clusters are dissimilar. • Note that good scores on an internal criterion do not necessarily translate into good effectiveness in an application. • The evaluate should well correspond to user perspective. • External criterion: • We evaluate how well the clustering matches the gold standard classes. • The gold standard is produced by human judges with a good level of inter-judge agreement. • Is popular and objective!! • We introduce several external criteria of clustering quality.
Evaluation of Clustering (2/6) • Purity: • Each cluster is assigned to the class which is most frequent in the cluster. • Then measure the accuracy of the assignment. {c1,c2,…,cJ} is the set of classes {w1,w2,…,wK} is the set of clusters
Bias - high purity is easy to achieve when the number of cluster is large. 1 if each document gets its own cluster. Evaluation of Clustering (3/6)
Evaluation of Clustering (4/6) • We can view clustering as a series of decision, one for each of the N(N-1)/2 pairs of documents in the collection. • True positive (TP): assign two similar (class) documents to the same cluster. • True negative(TN): assign two dissimilar (different-class) documents to different clusters. • False positive(FP): assign two dissimilar documents to the same cluster. • False negative(FN): assign two similar documents to different clusters.
Evaluation of Clustering (5/6) • The rand index(RI) measures the percentage of decisions that are correct. TP+FP = C62+C62+C52 =40 TP = C52+C42+C32+C22 =20
Evaluation of Clustering (6/6) • RI is then (20 + 72) / (20 + 20 + 24 + 72) ≈ 0.68. • F measure: • In information retrieval, evaluating clustering with F has the advantage that the measure is already familiar to the research community.
K-means (1/11) • The most important flat clustering algorithm. • A cluster center is defined as the mean or centroid of the document in a cluster ω: document vector centroid vector of cluster ω size of ω
K-means (2/11) • How well the centroids represent the members of their clusters: • Residual sum of squares(RSS) – the squared distance of each vector from its centroid, summed over all vectors: • RSS is the objective function in K-means and our goal is to minimize it. for a certain cluster k
K-means (3/11) • The first step of K-means is to randomly select K documents, as the seeds (or initial cluster centroids). • It then iteratively repeats two steps until a stopping criterion is met. • Re-assigning documents to the cluster with the closest centroid. • Re-computing each centroid based on the current members of its cluster.
Termination conditions: A fixed number of iterations I has been completed. Limit the runtime of the clustering algorithm. In some cases, the quality of the clustering will be poor because of an insufficient number of iterations. Assignment of documents to clusters does not change between iterations. Equivalent to “centroids do not change between iterations”. Generally produce a good clustering. But runtime may be unacceptably long. Terminate when RSS falls below a threshold. Usually combine with a bound on the number of iterations to guarantee termination. Terminate when the decrease in RSS falls below a threshold. For a small threshold, the clustering are close to convergece. Again, need to combine it with a bound on the number of iterations to prevent very long runtimes. K-means (5/11)
K-means (6/11) • Will the two steps make the objective function converged? • First, RSS decreases in the reassignment step. • Each vector is assigned to the closest centroid. • The distance it contributes to RSS decreases. • Secondly, we have to show that RSS decreases in the re-computation step. • To acquire vm that minimizes RSS, we can .
K-means (7/11) • Then … • Setting the partial derivative to zero, we get: • vm is the component-wise definition of the centroid. • Thus, we minimize RSSk when the old centroid is replaced with the new centroid. • RSS, the sum of the RSSk, must then also decrease during re-computation.
K-means (8/11) • Since there is a finite set of possible clusterings, a monotonically decreasing algorithm will eventually arrive at a (local) minimum. • Note that, if there are several equidistant centroids, assigning a document to the cluster with the lowest index. • Otherwise, the algorithm can cycle forever in a loop of clusterings that have the same cost. • Outlier: • If an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. • Thus, we end up with a singleton cluster(a cluster with only one document). • There is probably a clustering with lower RSS (a better local minimum).
K-means (9/11) • Effective heuristics for seed selection: • Excluding outliers from the seed set. • Trying out multiple starting points and choosing the clustering with lowest cost. • Obtaining seeds from another method such as hierarchical clustering. • On a small random sample of the original collection.
K-means (10/11) • The time complexity of K-means: • In re-computation step: • Each vector gets added to a centroid once. • So the complexity is O(MN). • In re-assignment step: • Distance computation costs O(M). • The re-assignment computes KN distances, so O(KNM). • For a fixed number of iterations I, the overall complexity is O(IKNM).
K-means (11/11) • Thus, K-means is linear in all factors. • However, M usually is large that may make distance computations time consuming. • Solutions: • Truncating centroid to the most significant k terms (e.g., k = 1000). • Hardly decrease cluster quality while achieving a significant speedup of re-assignment step. • K-medoids – compute mediods instead of centroids as cluster centers. • Medoid – the document vector that is closest to the centroid. • Since document vectors are sparse (than centroid), distance computations are fast (??).