Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering

Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering Nikos Hourdakis, MSc Thesis

Motivation • Large document collections in many applications. • Digital libraries, Web • There is additional interest in methods for more effective management of information. • Abtraction, Browsing, Classification, Retrieval • Clustering is the means for achieving better organization of information. • The data space is partitioned into groups of entities with similar content. Nikos Hourdakis, MSc Thesis

Outline • Background • State-of-the-art clustering approaches • Partitional, hierarchical methods • K-Means and its variants • Incremental K-Means, Bisecting Incremental K-Means • Proposed method: BIC-Means • Bisecting Incremental K-Means using BIC as stopping criterion. • Evaluation of clustering methods • Application in Information Retrieval Nikos Hourdakis, MSc Thesis

Hierarchical Clustering (1/3) • Nested sequence of clusters. • Two approaches: • Agglomerative: Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster. • Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters. Nikos Hourdakis, MSc Thesis

. . . . . 1 1 . . . . . . 4 . 6 . 2 2 3 . 3 . . . . . . . . 5 . . . . . 7 4 5 6 7 . . Hierarchical Clustering – Example (2/3) Nikos Hourdakis, MSc Thesis

Hierarchical Clustering (3/3) • Organization and browsing of large document collections call for hierarchical clustering but: • Agglomerative clustering have quadratic time complexity. • Prohibitive for large data sets. Nikos Hourdakis, MSc Thesis

Partitional Clustering • We focus on PartitionalClustering • K-Means, • Incremental K-Means, • Bisecting K-Means • At least as good as hierarchical. • Low complexity, O(KN) • Faster than hierarchical for large document collections. Nikos Hourdakis, MSc Thesis

K-Means • Randomly select K centroids • Repeat ITER times or until the centroids do not change: • Assign each instance to the cluster whose centroid it is closest. • Re-compute the cluster centroids. • Generates a flat partition of K Clusters (K must be known in advance). • Centroid is the mean of a group of instances. Nikos Hourdakis, MSc Thesis

. . . . . . . . . . . . . . . . . . . . . . K-Means Example C C x x C Nikos Hourdakis, MSc Thesis

K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html Nikos Hourdakis, MSc Thesis

K-Means demo (2/7) Nikos Hourdakis, MSc Thesis

Comments • No proof of convergence • Converges to a local minimum of the distortion measure (average of the square distance of the points from their nearest centroids): ΣiΣd(d-μc)2 • Too slow for practical databases • K-means fully deterministic once initial centroids selected. • Bad choice of initial centroids leads to poor clusters. Nikos Hourdakis, MSc Thesis

Incremental K-Means (IK) • In K-Means new centroids are computed after each iteration (after all documents have been examined). • In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster: Nikos Hourdakis, MSc Thesis

Comments • Not as sensitive as K-Means to the selection of initial centroids. • Faster convergence, much faster in general Nikos Hourdakis, MSc Thesis

Bisecting IK-Means (1/4) • A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection. • The documents are initially partitioned into two clusters. • The algorithm iteratively selects and bisects each one of the leaf clusters until singleton clusters are reached. Nikos Hourdakis, MSc Thesis

BisectingIK-means (2/4) • Input: (d1,d2…dN) • Output: hierarchy of K-clusters • All document in cluster C • Apply IK-means to split C into K clusters (K=2) C1,C2,…CK leaf clusters • Iteratively split each Ci cluster until K clusters or singleton clusters are produces at the leafs Nikos Hourdakis, MSc Thesis

Bisecting IK-Means (3/4) • The algorithm is exhaustive terminating at singleton clusters (unless K is known) • Terminating at singleton clusters • Is time consuming • Singleton clusters are meaningless • Intermediate clusters are more likely to correspond to real classes • No criterion for stopping bisections before singleton clusters are reached. Nikos Hourdakis, MSc Thesis

Bayesian Information Criterion (BIC) (1/3) • To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached. • Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978]. • X-Means[Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values. Nikos Hourdakis, MSc Thesis

Bayesian Information Criterion (BIC) (2/3) • In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not. • It measures the improvement of the cluster structure between a cluster and its two children clusters. • We compute the BIC score of: • A cluster and of its • Two children clusters. Nikos Hourdakis, MSc Thesis

Bayesian Information Criterion (BIC) (3/3) • If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split. • We keep the parent cluster as it is. • Otherwise, we accept the split and the algorithm proceeds similarly to lower levels. Nikos Hourdakis, MSc Thesis

Example • The BIC score of the parent cluster is lessthan BIC score of the generated cluster structure => we accept the bisection. Parent cluster: BIC(K=1)=1980 Two resulting clusters: BIC(K=2)=2245 Nikos Hourdakis, MSc Thesis

Computing BIC • The BIC score of a data collection is defined as (Kass and Wasserman, 1995): where is the log-likelihood of the data set D, Pj = M*K+1, is a function of the number of independent parameters and R is the number of points. Nikos Hourdakis, MSc Thesis

Log-likelihood • Given a cluster of points, that produces a Gaussian distribution N(μ, σ2), log-likelihood is the probability that a neighborhood of data points follows this distribution. • The log-likelihood of the data can be considered as a measure of the cohesiveness of a cluster. • It estimates how closely to the centroid are the points of the cluster. Nikos Hourdakis, MSc Thesis

Parameters pj • Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions. • We penalize log-likelihood by a function of the number of independent parameters (pj/2*logR). Nikos Hourdakis, MSc Thesis

Notation • μj : coordinates of j-th centroid • μ(i) : centroid nearest to i-th data point • D: input set of data points • Dj : set of data points that have μ(j) as their closest centroid • R = |D| and Ri = |Di| • M: the number of dimensions • Mj: family of alternative models (different models correspond clustering solutions) • BIC scores the models and chooses the best among K models Nikos Hourdakis, MSc Thesis

Computing BIC (1/3) • To compute log-likelihood of data we need the parameters of the Gaussian for the data • Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption) Nikos Hourdakis, MSc Thesis

Computing BIC (2/3) • Probability of point xi : Gaussian with the estimated σ and mean the nearest cluster centroid to xi • Log likelihood of data Nikos Hourdakis, MSc Thesis

Computing BIC (3/3) • Focusing on the set Dn of points which belong to centroid n Nikos Hourdakis, MSc Thesis

Proposed Method: BIC-Means (1/2) • BIC: Bisecting InCremental K-Meansclustering incorporating BIC as the stopping criterion. • BIC performs a splitting test at each leaf cluster to prevent it from over-splitting. • BIC-Means doesn’t terminate at singleton clusters. • BIC-Means terminates when there are no separable clusters according to BIC. Nikos Hourdakis, MSc Thesis

Proposed Method: BIC-Means (2/2) • Combines the strengths of partitional and hierarchical clustering methods • Hierarchical clustering • Low complexity (O(N*K)) • Good clustering quality • Produces meaningful clusters at the leafs Nikos Hourdakis, MSc Thesis

BIC-Means Algorithm • Input: S: (d1, d2,…,dn) data in one cluster • Output: A hierarchy of clusters. • All documents in one cluster C. • Apply Incremental K-Means to split C into C1, C2. • Compute BIC for C and C1, C2 : • If BIC(C)< BIC(C1, C2) put C1, C2 in queue • Otherwise do not split C • Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC. Nikos Hourdakis, MSc Thesis

Evaluation • Evaluation of document clustering algorithms. • Two data sets: OHSUMED (233,445 Medline documents), Reuters (21578 documents). • Application of clustering to information retrieval • Evaluation of several cluster-based retrieval strategies. • Comparison with retrieval by exhaustive search on OHSUMED. Nikos Hourdakis, MSc Thesis

F-Measure • Howe good the clusters approximate data classes • F-Measure for cluster C and class T is defined as: , where , • The F measure of a class T is the maximum value it achieves over all clusters C: FT= maxCFTC • The F measure of the clustering solution is the mean FT (over all classes) Nikos Hourdakis, MSc Thesis

Comparison of Clustering Algorithms Nikos Hourdakis, MSc Thesis

Evaluation of Incremental K-Means Nikos Hourdakis, MSc Thesis

MeSH Representation of Documents • We use MeSH terms for describing medical documents (OHSUMED). • Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms). • Leads to more compact representation (each vector contains less terms, about 20). • Sequential approach to extract MeSH terms from OHSUMED documents. Nikos Hourdakis, MSc Thesis

Bisecting Incremental K-Means – Clustering Quality Nikos Hourdakis, MSc Thesis

Speed of Clustering Nikos Hourdakis, MSc Thesis

Evaluation of BIC-Means Nikos Hourdakis, MSc Thesis

Speed of Clustering Nikos Hourdakis, MSc Thesis

Comments • BIC-Means is much faster than Bisecting Incremental K-Means • Not exhaustive algorithm. • Achieves approximately the same F-Measure with the exhaustive Bisecting approach. • It is more suited for clustering large document collections. Nikos Hourdakis, MSc Thesis

Application of Clustering to Information Retrieval • We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED). • BIC-Means is applied on entire OHSUMED. • Each document is represented by MeSH terms. • Chose 61 queries of the original OHSUMED query set developed by Hersh et. al. • Each OHSUMED document has been judged as relevant to a query. Nikos Hourdakis, MSc Thesis

d1 d2 θ Query – Document Similarity • Similarity is defined as the cosine of the angle between document and query vectors. Nikos Hourdakis, MSc Thesis

Information Retrieval Methods • Method 1: Search M clusters closer to the query • Compute similarity between cluster centroid - query • Method 2: Search M clusters closer to the query • Each cluster is represented by the 20 most frequent terms of its centroid. • Method 3: Search M clusters whose centre contain the terms of the query. Nikos Hourdakis, MSc Thesis

Method 1:Search M clusters closer to the query (compute similarity between cluster centroid – query). Nikos Hourdakis, MSc Thesis

Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering