1 / 17

A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE

A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE. Advisor : Dr. Hsu Graduate : Jing Wei Lin Authors :lllhoi Yoo and Xiaohua Hu. 2006.JCDL.10. Outline. Motivation Objective Use MeSH Ontology on Vector Space Model

apryce
Download Presentation

A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comprehensive Comparison Study of Document Clustering for a Biomedical Digital Library MEDLINE Advisor : Dr. Hsu Graduate : Jing Wei Lin Authors :lllhoi Yoo and Xiaohua Hu 2006.JCDL.10

  2. Outline • Motivation • Objective • Use MeSH Ontology on Vector Space Model • Classification of Document Clustering Approaches • Experiment Results • Conclusion

  3. Motivation • Document clustering has been used for better document retrieval, document browsing, and text mining in digital library.

  4. Objective • The goal of this paper is to perform a comprehensive comparison study of various document clustering approaches in terms of the efficiency, the effectiveness, and the scalability.

  5. Use MeSH Ontology on Vector Space Model • Problem : • Each document d is represented as a high dimensional vector of words/terms frequencies. • Solution :For example, “Neoplasms” as a descriptor has the following entry terms {“Cancer”, “Cancers”, “Neoplasm”, “Tumors”}.

  6. Classification of Document Clustering Approaches • Hierarchical clustering algorithms • Advantage: they generate a document hierarchy so that users can drill up and drill down for specific topics of interest. • Disadvantage: because of their cubic time complexity, they are very much limited for very large documents. • Partitional clustering algorithms • Advantage: their time complexity is lower, for example, K-means is O(k*T*n). • Disadvantage: clustering results are heavily sensitive to the initial centroids because the centroids are randomly selected.

  7. CL CL w w CR CR …. …. CL w CR

  8. Experiment Results • How much does the selection method of Bisecting K-means for clusters to be bisected affect clustering results?

  9. Experiment Results • Does Bisecting K-means outperform K-means?

  10. Experiment Results • Does STC outperform hierarchical orpartitionalapproaches? STC v.s. Partitional STC v.s. Hierarchical

  11. Experiment Results • Do partitional clustering algorithmsoutperform hierarchical clusteringalgorithms?

  12. Experiment Results • Which of clustering algorithms is the most scalable or the least scalable?

  13. Experiment Results • How much does a domain ontology MeSH improve document clustering?

  14. Experiment Results

  15. Experiment Results • How the clustering evaluation metrics are related to one another? • The smaller MI and Entropy, the better clustering quality while the bigger F-measure and purity, the better clustering quality.

  16. Conclusion • The cluster selection methods of Bisecting K-means sufficiently affect clustering quality. • Bisecting Kmeans generally outperforms K-means if the cluster selection method Type A of Bisecting K-means is used. • STC provides better clustering solutions thanhierarchical algorithms but is worse than partitional clustering approaches.

  17. Conclusion • Partitional clustering approaches are significantly superior to hierarchical approaches in terms of clustering evaluation metrics and the running times. • Bisecting K-means is normally superior to other clustering methods. • A domain ontology MeSH improves document clustering for MEDLINE articles.

More Related