330 likes | 416 Views
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis. Frizo Janssens, Wolfgang Glänzel, and Bart De Moor. Presented by Cindy Burklow. CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008. Outline.
E N D
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, and Bart De Moor Presented by Cindy Burklow CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17th, 2008
Outline • Introduction • Motivation • Related Work • Proposed Models • Proposed Algorithms • Results: Hybrid & Dynamic Clustering • Discussion of Pros and Cons • Questions • References
Introduction • Bioinformatics … • Computer Science • Information Technology • Solves problems in Biomedicine • Goal of Paper: Investigate • Cognitive structure • Dynamics of bioinformatics core • Sub-disciplines • ISI Web of Science & MEDLINE • Retrieval of core literature in bioinformatics
MeSH = Medical Subject Headings Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.
Motivation • Bioinformatics field … • Dynamic • Evolving discipline • Fast growth rate • Monitor current trends • Predict future direction • Decision Making • Grants • Business Ventures • Research Opportunities
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.
Related Work • Web mining • Bibliometrics • Text mining & citation analysis • Mapping of knowledge • Charting science & technology fields • Textual & graph-based approaches • Different perceptions of similarity between documents or groups of documents
Related Work Establishing the Data Set • Patra & Mishra – Bibliometric Study • MeSH term based • Liberal delineation strategy with maximal recall • Broader interpretation of bioinformatics • Less restricted search strategy • Broader coverage of underlying database • 14,563 journal papers
Related Work • Hybrid Clustering • He – Unsupervised spectral clustering of web pages • Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages • Dynamic hybrid clustering • Mei & Zhai – Temporal Text Mining • Kullback-Leibler – Divergence for coherent themes & Hidden Markov Models • Griffiths & Steyvers – Latent Dirichlet Allocation with hot topics in PNAS abstracts
Models: Data SetBibliometric Retrieval Strategy • Novel subject delineation strategy • Retrieve core literature • Combines textual components & bibliometrics, citation-based techniques • Web of Science Edition of Thomson Scientific • 7401 bioinformatics-related papers • 1981 to 2004 • Titles, abstracts, author keywords, and MeSH terms
Models – Text Analysis • All text was indexed with Jakarta Lucene Platform • Encoded in Vector Space Model using TF-IDF weighting scheme • Text-based similarities • Cosine of angle between the vector representations of two papers • No Stop word used during indexing • Porter Stemmer • All remaining terms from titles and abstracts • Bigrams • Candidate list of MeSH descriptors, author keywords, and noun phrases • Latent Semantic Indexing (LSI) – 10 terms
Models – Citation Analysis • Citation Graphs • Link-based algorithms • HITS • PageRank Representative Publications Combine Cosine Bibliographic coupling (BC) QUANTIFY SIMILARITIES Text-based Citation-based Documents Boolean Input Vectors Co-citation Image Reference: Google Logo from http://www.google.com
Models – Clustering • Agglomerative Hierarchical Clustering Algorithm with Ward’s Method • Hard Clustering Algorithm: • Every publication is assigned to exactly1 cluster. Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering
Models – Clustering Optimal number of clusters Combine Distance-based & Stability-based Methods Strategy • Silhouette Curves: Mean text and • Citation-based • Dendrogram observation • Stability Diagram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Hybrid Clustering • Cluster Input: Distances • Combining text mining and bibliometrics • Integrate text & citation info early in mapping process before applying of clustering algorithm • Weighted linear combination • Fisher’s inverse chi-square method Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Dynamic Hybrid Clustering • Goal: Match & track clusters through time • Process: • Separate hybrid clustering for each period • Determine optimal number of clusters • Dendrogram • Silhouette curve • Ben-hurstability plot • Construct complete graph • All cluster centroids from each period as nodes • Edge weights as mutual cosine similarities in LSS • Form Cluster Chains • Keep edge weights > threshold, T1 • Allow qualifying clusters to join > threshold, T2
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Results – Hybrid ClusteringSilhouette Curve Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringSilhouette Curve Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringStability Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringDendrogram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringCluster Characterization Microarray analysis 1147 Protein structure prediction Phylogeny & Evolution 1167 749 Genome sequencing & assembly Molecular DBs & annotation platforms 640 1091 Systems biology & molecular networks 694 Gene / promoter / motif prediction Multiple sequence alignment 995 713 RNA structure prediction 205
Result – Dynamics ClusteringHistogram Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Dynamics ClusteringCluster Chains Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Yearly Publication Output among Cluster chains Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Dynamic Term Network Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Pros & Cons • Pros • Offers fresh perspective on clustering • Integrates various techniques • Provides insight into bioinformatics • Cons • Challenge of selecting the optimal number of clusters still exists • There are many steps required to implement their approach
References • Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233 • ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch • PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ • The Apache Jakarta Project: http://lucene.apache.org/java/1_4_3/ • Fisher’s Method: http://en.wikipedia.org/wiki/Fisher%27s_method • “Data Mining - Concepts and techniques” by Han and Kamber, Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)