290 likes | 297 Views
This article provides an introduction to document clustering, including the clustering process, document representation, clustering algorithms, and evaluation measures. It also discusses the applications and complexities of document clustering.
E N D
SpecialTopics inTextMining Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx University of Alabama at Birmingham, Spring 2011
Agenda • The problem of document clustering • Some applications • The clustering process • Document representation • Clustering algoritms • Evaluation of clustering • Internal measures • External measures Special Topics on Information Retrieval
Clustering • Refers to the task of partitioning data into groups or clusters of similar objects. • In document clustering, documents = objects • We can also cluster paragraphs or words • The idea is that clusters consist of objects that are similar between themselves and dissimilar to the objects of other groups. How difficult is this task? Why? Why do we want to do document clustering? Special Topics on Information Retrieval
Complexities of the task – subjectivity Taken from: Henry Lin, Artificial Intelligence Course Slides Special Topics on Information Retrieval
Complexities of the task – similarity • At the end we want clusters of similar objects • Most important thing is how to represent the objects (documents) • Feature selection has to be drivenby the application • Because we want to dividedocuments by topics, wordsand word n-grams are themost common features. Special Topics on Information Retrieval
Applications of document clustering • As a visualization/analysis technique • Discover main topics in large document collections • Visualize of search results • As a preprocessing step • Improve search process – cluster hypotesis • Facilitate the labeling process (for supervised classification) • Help eliminating redundancies in document summarization Special Topics on Information Retrieval
Document clustering process • Three main steps: • Feature extraction: words or word n-grams; TF-IDF or Boolean values • Similarity evaluation: cosine similarity • Clustering construction: divide documents into groups according to the chosen representation and similarity measure;several algorithms Document-document Similarity matrix Document-feature matrix Clusters Documents Feature extraction Similarity evaluation Clustering Construction Special Topics on Information Retrieval
Different types of clusterings • Partitional vs. Hierarchical • Partitional: division of a set of objects into non-overlapping subsets • Hierarchical: nested clusters organized as a tree • Exclusive vs. overlapping vs. fuzzy • Exclusive: objects belong to one single cluster • Overlapping: objects may belong to more than one cluster • Fuzzy: objects belongs to every cluster with a membership weight that is between 0 and 1. Special Topics on Information Retrieval
Clustering algorithms • A clustering algorithm groups the input data according to a set of predefined criteria. Its goal is to maximize the intra-cluster similarity and minimize the inter-cluster similarity. • In document clustering: • K-means (partitional clustering) • Agglomerative hierarchical clustering • Also have been used methods based on graphs (star algorithm) and on NN (SOM) Special Topics on Information Retrieval
Partitional clustering • This kind of methods generate a unique partition in K clusters of a given set of N documents • Each document is placed in exactly one cluster. • User has to input the K value. • The idea is to find, from all possible combinations of N documents in K clusters, the one with the lowest quadratic error How to do this? Special Topics on Information Retrieval
K-means algorithm • Decide on a value for k • Initialize the k cluster centers by randomly selecting k documents Alternative ideas? • Decide the class memberships of the N documents by assigning them to the nearest cluster center. • Re-estimate the k cluster centers, by assuming the memberships found above are correct. How to compute the new centers? • If none of the N documents changed membership in the last iteration, exit. Otherwise goto 3. A different stop condition? Special Topics on Information Retrieval
Some comments on K-means • Advantages: • Relative efficient (O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n) • Easily parallelized • Disadvantages: • Need to specify the number of clusters in advance • Sensitive to initial conditions • In addition, it has problems when clusters are of different sizes, densities and non-globular shapes Special Topics on Information Retrieval
K-means limitations (1) Taken from slides of Professor Xindong Wu, Department of Computer Science, University of Vermont. Different densities Special Topics on Information Retrieval
K-means limitations (2) Non-globular shapes Taken from slides of Professor Xindong Wu, Department of Computer Science, University of Vermont. Special Topics on Information Retrieval
Hierarchical clustering • Create a hierarchical decomposition of the set of objects using some criterion. • They produce a nested data set • Clusters within clusters! • Can be visualized using a dendrogram • From it, it is possible to determine the “correct” number of clusters • Two main kind of algorithms: agglomerative and divisive Special Topics on Information Retrieval
Example of agglomerative clustering • We know how to measure the distance between two documents, but defining the distance between an object and a cluster, or defining the distance between two clusters is non obvious. How to do that? Special Topics on Information Retrieval
Distance/similarity between clusters • Single linkage (nearest neighbor): the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. • Complete linkage (furthest neighbor): the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). • Group average linkage: the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. • Centroid-based linkage: the distance between two clusters is determined by the distance between their centroids/methoids. Special Topics on Information Retrieval
Comments on hierarchical clustering • Advantages: • No need to specify the number of clusters in advance • Hierarchical structure facilitates visualization and browsing tasks. • Disadvantages: • Not very efficient: quadratic time complexity. • Like any heuristic search algorithms, local optima are a problem (the same for partitional methods!). Special Topics on Information Retrieval
Clustering evaluation • The are well-accepted evaluation measures and procedures for text classification. • Training and test data; cross-fold validation • Recall, precision, F-measure, accuracy, etc. • Clustering evaluation is not so simple, because it is a subjective task Ideas for an evaluation procedure? Which information is necessary to have? What to measure? Special Topics on Information Retrieval
Issues on clustering evaluation • Determine the cluster tendency of a set of data (exists a non-random structure in data?) • Determine the correct number of clusters • Evaluate how well one clustering output fits the data without external info. • Compare the results of a clustering algorithm to externally known results • Compare two clusterings to determine which is better. Ideas? Special Topics on Information Retrieval
Evaluation measures • Unsupervised (internal) • Measures the goodness of a clustering structure without respect to external information • Cluster cohesion and separation • Supervised (external) • Measures the extent in which the generated clustering structure matches some external structure. • Entropy, F-measure, Jaccardcoeficient, etc. Special Topics on Information Retrieval
Unsupervised measures • In general, overall cluster validity is computed as a weighted sum of the validity of individual clusters • Where the validity function can be defined as a cohesion, separation or as their combination Special Topics on Information Retrieval
Silhouette coefficient • Combines cohesion and separation values • Gives information for instances; groups, by averaging info from instances; clustering, by averaging info from groups. • Values between -1 and 1. Special Topics on Information Retrieval
Visualizing the similarity matrix • Procedure: order the similarity matrix by clusters, and plot it in gray tones • Black = 1; white = 0. • For a good clustering, the similarity matrix should be roughly block-diagonal How to obtain a number from it? Special Topics on Information Retrieval
Supervised measures • It is necessary to have external information • A manual solution: class labels for the documents • Two main kind of measures: • Similarity-oriented: measure the extend to which two documents that are in the same class are in the same cluster and vice versa. • Classification-oriented: measure the extend to which a cluster contains documents of a single class. Special Topics on Information Retrieval
Similarity measures • Measures on the premise that two documents in the same cluster must belong to the same and vice versa. Special Topics on Information Retrieval
Classification oriented measures (1) • Entropy: measures the degree to which each cluster consists on a single class. The probability that a member ofcluster i belongs to class j Entropy of cluster I L is the number of classes The total entropy is a weighted average ofindividual entropies.Weights correspond to cluster sizes. Special Topics on Information Retrieval
Classification oriented measures (2) • Precision: the fraction of a clusteri that consists of objects of a specified class j. • Recall: the extent to which a clusteri contains all objects of a specifiedclass j. • F-measure: combination of both measures; indicates the extent to which a cluster i contains only objects of a class j and all the objects of that class. • Global F-measure is a weightedaverage of all per-class F-measures. SpecialTopicsonInformationRetrieval