1 / 59

Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method. Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering. Motivation. Large document collections in many applications.

cleave
Download Presentation

Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering Nikos Hourdakis, MSc Thesis

  2. Motivation • Large document collections in many applications. • Digital libraries, Web • There is additional interest in methods for more effective management of information. • Abtraction, Browsing, Classification, Retrieval • Clustering is the means for achieving better organization of information. • The data space is partitioned into groups of entities with similar content. Nikos Hourdakis, MSc Thesis

  3. Outline • Background • State-of-the-art clustering approaches • Partitional, hierarchical methods • K-Means and its variants • Incremental K-Means, Bisecting Incremental K-Means • Proposed method: BIC-Means • Bisecting Incremental K-Means using BIC as stopping criterion. • Evaluation of clustering methods • Application in Information Retrieval Nikos Hourdakis, MSc Thesis

  4. Hierarchical Clustering (1/3) • Nested sequence of clusters. • Two approaches: • Agglomerative: Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster. • Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters. Nikos Hourdakis, MSc Thesis

  5. . . . . . 1 1 . . . . . . 4 . 6 . 2 2 3 . 3 . . . . . . . . 5 . . . . . 7 4 5 6 7 . . Hierarchical Clustering – Example (2/3) Nikos Hourdakis, MSc Thesis

  6. Hierarchical Clustering (3/3) • Organization and browsing of large document collections call for hierarchical clustering but: • Agglomerative clustering have quadratic time complexity. • Prohibitive for large data sets. Nikos Hourdakis, MSc Thesis

  7. Partitional Clustering • We focus on PartitionalClustering • K-Means, • Incremental K-Means, • Bisecting K-Means • At least as good as hierarchical. • Low complexity, O(KN) • Faster than hierarchical for large document collections. Nikos Hourdakis, MSc Thesis

  8. K-Means • Randomly select K centroids • Repeat ITER times or until the centroids do not change: • Assign each instance to the cluster whose centroid it is closest. • Re-compute the cluster centroids. • Generates a flat partition of K Clusters (K must be known in advance). • Centroid is the mean of a group of instances. Nikos Hourdakis, MSc Thesis

  9. . . . . . . . . . . . . . . . . . . . . . . K-Means Example C C x x C Nikos Hourdakis, MSc Thesis

  10. K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html Nikos Hourdakis, MSc Thesis

  11. K-Means demo (2/7) Nikos Hourdakis, MSc Thesis

  12. K-Means demo (3/7) Nikos Hourdakis, MSc Thesis

  13. K-Means demo (4/7) Nikos Hourdakis, MSc Thesis

  14. K-Means demo (5/7) Nikos Hourdakis, MSc Thesis

  15. K-Means demo (6/7) Nikos Hourdakis, MSc Thesis

  16. K-Means demo (7/7) Nikos Hourdakis, MSc Thesis

  17. Comments • No proof of convergence • Converges to a local minimum of the distortion measure (average of the square distance of the points from their nearest centroids): ΣiΣd(d-μc)2 • Too slow for practical databases • K-means fully deterministic once initial centroids selected. • Bad choice of initial centroids leads to poor clusters. Nikos Hourdakis, MSc Thesis

  18. Incremental K-Means (IK) • In K-Means new centroids are computed after each iteration (after all documents have been examined). • In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster: Nikos Hourdakis, MSc Thesis

  19. Comments • Not as sensitive as K-Means to the selection of initial centroids. • Faster convergence, much faster in general Nikos Hourdakis, MSc Thesis

  20. Bisecting IK-Means (1/4) • A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection. • The documents are initially partitioned into two clusters. • The algorithm iteratively selects and bisects each one of the leaf clusters until singleton clusters are reached. Nikos Hourdakis, MSc Thesis

  21. BisectingIK-means (2/4) • Input: (d1,d2…dN) • Output: hierarchy of K-clusters • All document in cluster C • Apply IK-means to split C into K clusters (K=2) C1,C2,…CK leaf clusters • Iteratively split each Ci cluster until K clusters or singleton clusters are produces at the leafs Nikos Hourdakis, MSc Thesis

  22. Bisecting IK-Means (3/4) • The algorithm is exhaustive terminating at singleton clusters (unless K is known) • Terminating at singleton clusters • Is time consuming • Singleton clusters are meaningless • Intermediate clusters are more likely to correspond to real classes • No criterion for stopping bisections before singleton clusters are reached. Nikos Hourdakis, MSc Thesis

  23. Bayesian Information Criterion (BIC) (1/3) • To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached. • Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978]. • X-Means[Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values. Nikos Hourdakis, MSc Thesis

  24. Bayesian Information Criterion (BIC) (2/3) • In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not. • It measures the improvement of the cluster structure between a cluster and its two children clusters. • We compute the BIC score of: • A cluster and of its • Two children clusters. Nikos Hourdakis, MSc Thesis

  25. Bayesian Information Criterion (BIC) (3/3) • If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split. • We keep the parent cluster as it is. • Otherwise, we accept the split and the algorithm proceeds similarly to lower levels. Nikos Hourdakis, MSc Thesis

  26. Example • The BIC score of the parent cluster is lessthan BIC score of the generated cluster structure => we accept the bisection. Parent cluster: BIC(K=1)=1980 Two resulting clusters: BIC(K=2)=2245 Nikos Hourdakis, MSc Thesis

  27. Computing BIC • The BIC score of a data collection is defined as (Kass and Wasserman, 1995): where is the log-likelihood of the data set D, Pj = M*K+1, is a function of the number of independent parameters and R is the number of points. Nikos Hourdakis, MSc Thesis

  28. Log-likelihood • Given a cluster of points, that produces a Gaussian distribution N(μ, σ2), log-likelihood is the probability that a neighborhood of data points follows this distribution. • The log-likelihood of the data can be considered as a measure of the cohesiveness of a cluster. • It estimates how closely to the centroid are the points of the cluster. Nikos Hourdakis, MSc Thesis

  29. Parameters pj • Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions. • We penalize log-likelihood by a function of the number of independent parameters (pj/2*logR). Nikos Hourdakis, MSc Thesis

  30. Notation • μj : coordinates of j-th centroid • μ(i) : centroid nearest to i-th data point • D: input set of data points • Dj : set of data points that have μ(j) as their closest centroid • R = |D| and Ri = |Di| • M: the number of dimensions • Mj: family of alternative models (different models correspond clustering solutions) • BIC scores the models and chooses the best among K models Nikos Hourdakis, MSc Thesis

  31. Computing BIC (1/3) • To compute log-likelihood of data we need the parameters of the Gaussian for the data • Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption) Nikos Hourdakis, MSc Thesis

  32. Computing BIC (2/3) • Probability of point xi : Gaussian with the estimated σ and mean the nearest cluster centroid to xi • Log likelihood of data Nikos Hourdakis, MSc Thesis

  33. Computing BIC (3/3) • Focusing on the set Dn of points which belong to centroid n Nikos Hourdakis, MSc Thesis

  34. Proposed Method: BIC-Means (1/2) • BIC: Bisecting InCremental K-Meansclustering incorporating BIC as the stopping criterion. • BIC performs a splitting test at each leaf cluster to prevent it from over-splitting. • BIC-Means doesn’t terminate at singleton clusters. • BIC-Means terminates when there are no separable clusters according to BIC. Nikos Hourdakis, MSc Thesis

  35. Proposed Method: BIC-Means (2/2) • Combines the strengths of partitional and hierarchical clustering methods • Hierarchical clustering • Low complexity (O(N*K)) • Good clustering quality • Produces meaningful clusters at the leafs Nikos Hourdakis, MSc Thesis

  36. BIC-Means Algorithm • Input: S: (d1, d2,…,dn) data in one cluster • Output: A hierarchy of clusters. • All documents in one cluster C. • Apply Incremental K-Means to split C into C1, C2. • Compute BIC for C and C1, C2 : • If BIC(C)< BIC(C1, C2) put C1, C2 in queue • Otherwise do not split C • Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC. Nikos Hourdakis, MSc Thesis

  37. Evaluation • Evaluation of document clustering algorithms. • Two data sets: OHSUMED (233,445 Medline documents), Reuters (21578 documents). • Application of clustering to information retrieval • Evaluation of several cluster-based retrieval strategies. • Comparison with retrieval by exhaustive search on OHSUMED. Nikos Hourdakis, MSc Thesis

  38. F-Measure • Howe good the clusters approximate data classes • F-Measure for cluster C and class T is defined as: , where , • The F measure of a class T is the maximum value it achieves over all clusters C: FT= maxCFTC • The F measure of the clustering solution is the mean FT (over all classes) Nikos Hourdakis, MSc Thesis

  39. Comparison of Clustering Algorithms Nikos Hourdakis, MSc Thesis

  40. Evaluation of Incremental K-Means Nikos Hourdakis, MSc Thesis

  41. MeSH Representation of Documents • We use MeSH terms for describing medical documents (OHSUMED). • Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms). • Leads to more compact representation (each vector contains less terms, about 20). • Sequential approach to extract MeSH terms from OHSUMED documents. Nikos Hourdakis, MSc Thesis

  42. Bisecting Incremental K-Means – Clustering Quality Nikos Hourdakis, MSc Thesis

  43. Speed of Clustering Nikos Hourdakis, MSc Thesis

  44. Evaluation of BIC-Means Nikos Hourdakis, MSc Thesis

  45. Speed of Clustering Nikos Hourdakis, MSc Thesis

  46. Comments • BIC-Means is much faster than Bisecting Incremental K-Means • Not exhaustive algorithm. • Achieves approximately the same F-Measure with the exhaustive Bisecting approach. • It is more suited for clustering large document collections. Nikos Hourdakis, MSc Thesis

  47. Application of Clustering to Information Retrieval • We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED). • BIC-Means is applied on entire OHSUMED. • Each document is represented by MeSH terms. • Chose 61 queries of the original OHSUMED query set developed by Hersh et. al. • Each OHSUMED document has been judged as relevant to a query. Nikos Hourdakis, MSc Thesis

  48. d1 d2 θ Query – Document Similarity • Similarity is defined as the cosine of the angle between document and query vectors. Nikos Hourdakis, MSc Thesis

  49. Information Retrieval Methods • Method 1: Search M clusters closer to the query • Compute similarity between cluster centroid - query • Method 2: Search M clusters closer to the query • Each cluster is represented by the 20 most frequent terms of its centroid. • Method 3: Search M clusters whose centre contain the terms of the query. Nikos Hourdakis, MSc Thesis

  50. Method 1:Search M clusters closer to the query (compute similarity between cluster centroid – query). Nikos Hourdakis, MSc Thesis

More Related