1 / 33

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis. Frizo Janssens, Wolfgang Glänzel, and Bart De Moor. Presented by Cindy Burklow. CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008. Outline.

Download Presentation

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, and Bart De Moor Presented by Cindy Burklow CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17th, 2008

  2. Outline • Introduction • Motivation • Related Work • Proposed Models • Proposed Algorithms • Results: Hybrid & Dynamic Clustering • Discussion of Pros and Cons • Questions • References

  3. Introduction • Bioinformatics … • Computer Science • Information Technology • Solves problems in Biomedicine • Goal of Paper: Investigate • Cognitive structure • Dynamics of bioinformatics core • Sub-disciplines • ISI Web of Science & MEDLINE • Retrieval of core literature in bioinformatics

  4. MeSH = Medical Subject Headings Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.

  5. Motivation • Bioinformatics field … • Dynamic • Evolving discipline • Fast growth rate • Monitor current trends • Predict future direction • Decision Making • Grants • Business Ventures • Research Opportunities

  6. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

  7. Related Work • Web mining • Bibliometrics • Text mining & citation analysis • Mapping of knowledge • Charting science & technology fields • Textual & graph-based approaches • Different perceptions of similarity between documents or groups of documents

  8. Related Work Establishing the Data Set • Patra & Mishra – Bibliometric Study • MeSH term based • Liberal delineation strategy with maximal recall • Broader interpretation of bioinformatics • Less restricted search strategy • Broader coverage of underlying database • 14,563 journal papers

  9. Related Work • Hybrid Clustering • He – Unsupervised spectral clustering of web pages • Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages • Dynamic hybrid clustering • Mei & Zhai – Temporal Text Mining • Kullback-Leibler – Divergence for coherent themes & Hidden Markov Models • Griffiths & Steyvers – Latent Dirichlet Allocation with hot topics in PNAS abstracts

  10. Models: Data SetBibliometric Retrieval Strategy • Novel subject delineation strategy • Retrieve core literature • Combines textual components & bibliometrics, citation-based techniques • Web of Science Edition of Thomson Scientific • 7401 bioinformatics-related papers • 1981 to 2004 • Titles, abstracts, author keywords, and MeSH terms

  11. Models – Text Analysis • All text was indexed with Jakarta Lucene Platform • Encoded in Vector Space Model using TF-IDF weighting scheme • Text-based similarities • Cosine of angle between the vector representations of two papers • No Stop word used during indexing • Porter Stemmer • All remaining terms from titles and abstracts • Bigrams • Candidate list of MeSH descriptors, author keywords, and noun phrases • Latent Semantic Indexing (LSI) – 10 terms

  12. Models – Citation Analysis • Citation Graphs • Link-based algorithms • HITS • PageRank Representative Publications Combine Cosine Bibliographic coupling (BC) QUANTIFY SIMILARITIES Text-based Citation-based Documents Boolean Input Vectors Co-citation Image Reference: Google Logo from http://www.google.com

  13. Models – Clustering • Agglomerative Hierarchical Clustering Algorithm with Ward’s Method • Hard Clustering Algorithm: • Every publication is assigned to exactly1 cluster. Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering

  14. Models – Clustering Optimal number of clusters Combine Distance-based & Stability-based Methods Strategy • Silhouette Curves: Mean text and • Citation-based • Dendrogram observation • Stability Diagram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.

  15. Proposed Algorithm – Hybrid Clustering • Cluster Input: Distances • Combining text mining and bibliometrics • Integrate text & citation info early in mapping process before applying of clustering algorithm • Weighted linear combination • Fisher’s inverse chi-square method Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.

  16. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.

  17. Proposed Algorithm – Dynamic Hybrid Clustering • Goal: Match & track clusters through time • Process: • Separate hybrid clustering for each period • Determine optimal number of clusters • Dendrogram • Silhouette curve • Ben-hurstability plot • Construct complete graph • All cluster centroids from each period as nodes • Edge weights as mutual cosine similarities in LSS • Form Cluster Chains • Keep edge weights > threshold, T1 • Allow qualifying clusters to join > threshold, T2

  18. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

  19. Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

  20. Results – Hybrid ClusteringSilhouette Curve Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.

  21. Result – Hybrid ClusteringSilhouette Curve Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.

  22. Result – Hybrid ClusteringStability Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

  23. Result – Hybrid ClusteringDendrogram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

  24. Result – Hybrid ClusteringCluster Characterization Microarray analysis 1147 Protein structure prediction Phylogeny & Evolution 1167 749 Genome sequencing & assembly Molecular DBs & annotation platforms 640 1091 Systems biology & molecular networks 694 Gene / promoter / motif prediction Multiple sequence alignment 995 713 RNA structure prediction 205

  25. Result – Dynamics ClusteringHistogram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

  26. Result – Dynamics ClusteringCluster Chains Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

  27. Yearly Publication Output among Cluster chains Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.

  28. Dynamic Term Network Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.

  29. Pros & Cons • Pros • Offers fresh perspective on clustering • Integrates various techniques • Provides insight into bioinformatics • Cons • Challenge of selecting the optimal number of clusters still exists • There are many steps required to implement their approach

  30. Questions

  31. References • Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233 • ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch • PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ • The Apache Jakarta Project: http://lucene.apache.org/java/1_4_3/ • Fisher’s Method: http://en.wikipedia.org/wiki/Fisher%27s_method • “Data Mining - Concepts and techniques” by Han and Kamber, Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)

More Related