1 / 29

Special Topics in Text Mining

Special Topics in Text Mining. Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx University of Alabama at Birmingham, Spring 2011. Introduction to document clustering. Agenda. The problem of document clustering Some applications The clustering process

damia
Download Presentation

Special Topics in Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SpecialTopics inTextMining Manuel Montes y Gómezhttp://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx University of Alabama at Birmingham, Spring 2011

  2. Introductiontodocumentclustering

  3. Agenda • The problem of document clustering • Some applications • The clustering process • Document representation • Clustering algoritms • Evaluation of clustering • Internal measures • External measures Special Topics on Information Retrieval

  4. Clustering • Refers to the task of partitioning data into groups or clusters of similar objects. • In document clustering, documents = objects • We can also cluster paragraphs or words • The idea is that clusters consist of objects that are similar between themselves and dissimilar to the objects of other groups. How difficult is this task? Why? Why do we want to do document clustering? Special Topics on Information Retrieval

  5. Complexities of the task – subjectivity Taken from: Henry Lin, Artificial Intelligence Course Slides Special Topics on Information Retrieval

  6. Complexities of the task – similarity • At the end we want clusters of similar objects • Most important thing is how to represent the objects (documents) • Feature selection has to be drivenby the application • Because we want to dividedocuments by topics, wordsand word n-grams are themost common features. Special Topics on Information Retrieval

  7. Applications of document clustering • As a visualization/analysis technique • Discover main topics in large document collections • Visualize of search results • As a preprocessing step • Improve search process – cluster hypotesis • Facilitate the labeling process (for supervised classification) • Help eliminating redundancies in document summarization Special Topics on Information Retrieval

  8. Document clustering process • Three main steps: • Feature extraction: words or word n-grams; TF-IDF or Boolean values • Similarity evaluation: cosine similarity • Clustering construction: divide documents into groups according to the chosen representation and similarity measure;several algorithms Document-document Similarity matrix Document-feature matrix Clusters Documents Feature extraction Similarity evaluation Clustering Construction Special Topics on Information Retrieval

  9. Different types of clusterings • Partitional vs. Hierarchical • Partitional: division of a set of objects into non-overlapping subsets • Hierarchical: nested clusters organized as a tree • Exclusive vs. overlapping vs. fuzzy • Exclusive: objects belong to one single cluster • Overlapping: objects may belong to more than one cluster • Fuzzy: objects belongs to every cluster with a membership weight that is between 0 and 1. Special Topics on Information Retrieval

  10. Clustering algorithms • A clustering algorithm groups the input data according to a set of predefined criteria. Its goal is to maximize the intra-cluster similarity and minimize the inter-cluster similarity. • In document clustering: • K-means (partitional clustering) • Agglomerative hierarchical clustering • Also have been used methods based on graphs (star algorithm) and on NN (SOM) Special Topics on Information Retrieval

  11. Partitional clustering • This kind of methods generate a unique partition in K clusters of a given set of N documents • Each document is placed in exactly one cluster. • User has to input the K value. • The idea is to find, from all possible combinations of N documents in K clusters, the one with the lowest quadratic error How to do this? Special Topics on Information Retrieval

  12. K-means algorithm • Decide on a value for k • Initialize the k cluster centers by randomly selecting k documents Alternative ideas? • Decide the class memberships of the N documents by assigning them to the nearest cluster center. • Re-estimate the k cluster centers, by assuming the memberships found above are correct. How to compute the new centers? • If none of the N documents changed membership in the last iteration, exit. Otherwise goto 3. A different stop condition? Special Topics on Information Retrieval

  13. Some comments on K-means • Advantages: • Relative efficient (O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n) • Easily parallelized • Disadvantages: • Need to specify the number of clusters in advance • Sensitive to initial conditions • In addition, it has problems when clusters are of different sizes, densities and non-globular shapes Special Topics on Information Retrieval

  14. K-means limitations (1) Taken from slides of Professor Xindong Wu, Department of Computer Science, University of Vermont. Different densities Special Topics on Information Retrieval

  15. K-means limitations (2) Non-globular shapes Taken from slides of Professor Xindong Wu, Department of Computer Science, University of Vermont. Special Topics on Information Retrieval

  16. Hierarchical clustering • Create a hierarchical decomposition of the set of objects using some criterion. • They produce a nested data set • Clusters within clusters! • Can be visualized using a dendrogram • From it, it is possible to determine the “correct” number of clusters • Two main kind of algorithms: agglomerative and divisive Special Topics on Information Retrieval

  17. Example of agglomerative clustering • We know how to measure the distance between two documents, but defining the distance between an object and a cluster, or defining the distance between two clusters is non obvious. How to do that? Special Topics on Information Retrieval

  18. Distance/similarity between clusters • Single linkage (nearest neighbor): the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. • Complete linkage (furthest neighbor): the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). • Group average linkage: the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. • Centroid-based linkage: the distance between two clusters is determined by the distance between their centroids/methoids. Special Topics on Information Retrieval

  19. Comments on hierarchical clustering • Advantages: • No need to specify the number of clusters in advance • Hierarchical structure facilitates visualization and browsing tasks. • Disadvantages: • Not very efficient: quadratic time complexity. • Like any heuristic search algorithms, local optima are a problem (the same for partitional methods!). Special Topics on Information Retrieval

  20. Clustering evaluation • The are well-accepted evaluation measures and procedures for text classification. • Training and test data; cross-fold validation • Recall, precision, F-measure, accuracy, etc. • Clustering evaluation is not so simple, because it is a subjective task Ideas for an evaluation procedure? Which information is necessary to have? What to measure? Special Topics on Information Retrieval

  21. Issues on clustering evaluation • Determine the cluster tendency of a set of data (exists a non-random structure in data?) • Determine the correct number of clusters • Evaluate how well one clustering output fits the data without external info. • Compare the results of a clustering algorithm to externally known results • Compare two clusterings to determine which is better. Ideas? Special Topics on Information Retrieval

  22. Evaluation measures • Unsupervised (internal) • Measures the goodness of a clustering structure without respect to external information • Cluster cohesion and separation • Supervised (external) • Measures the extent in which the generated clustering structure matches some external structure. • Entropy, F-measure, Jaccardcoeficient, etc. Special Topics on Information Retrieval

  23. Unsupervised measures • In general, overall cluster validity is computed as a weighted sum of the validity of individual clusters • Where the validity function can be defined as a cohesion, separation or as their combination Special Topics on Information Retrieval

  24. Silhouette coefficient • Combines cohesion and separation values • Gives information for instances; groups, by averaging info from instances; clustering, by averaging info from groups. • Values between -1 and 1. Special Topics on Information Retrieval

  25. Visualizing the similarity matrix • Procedure: order the similarity matrix by clusters, and plot it in gray tones • Black = 1; white = 0. • For a good clustering, the similarity matrix should be roughly block-diagonal How to obtain a number from it? Special Topics on Information Retrieval

  26. Supervised measures • It is necessary to have external information • A manual solution: class labels for the documents • Two main kind of measures: • Similarity-oriented: measure the extend to which two documents that are in the same class are in the same cluster and vice versa. • Classification-oriented: measure the extend to which a cluster contains documents of a single class. Special Topics on Information Retrieval

  27. Similarity measures • Measures on the premise that two documents in the same cluster must belong to the same and vice versa. Special Topics on Information Retrieval

  28. Classification oriented measures (1) • Entropy: measures the degree to which each cluster consists on a single class. The probability that a member ofcluster i belongs to class j Entropy of cluster I L is the number of classes The total entropy is a weighted average ofindividual entropies.Weights correspond to cluster sizes. Special Topics on Information Retrieval

  29. Classification oriented measures (2) • Precision: the fraction of a clusteri that consists of objects of a specified class j. • Recall: the extent to which a clusteri contains all objects of a specifiedclass j. • F-measure: combination of both measures; indicates the extent to which a cluster i contains only objects of a class j and all the objects of that class. • Global F-measure is a weightedaverage of all per-class F-measures. SpecialTopicsonInformationRetrieval

More Related