240 likes | 353 Views
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor frizo.janssens@esat.kuleuven.be. Overview of the presentation. Introduction General context & objectives Clustering Text mining framework
E N D
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor frizo.janssens@esat.kuleuven.be Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren
Overview of the presentation • Introduction • General context & objectives • Clustering • Text mining framework • Bibliometrics, citation analysis • Hybrid (integrated) clustering • Linear combination • Fisher’s inverse chi-square method • Dynamic hybrid mapping of bioinformatics • Conclusions • Further research
General context • Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining • Complementary views on document set → other perceptions of similarity • Textual information: amount of words in common • Citation networks, bibliometric properties • Goal: • Integrate text mining & bibliometrics (hybrid approach) • Better clustering and classification performance • Mapping cognitive structure and dynamics of bioinformatics
10 women 10 men ? features Hair color Length Hair color Person 1 Person 2 (a) Person 3 ‘objects’ … Person 20 Length Interest in football Length Interested in football Hair color More Discriminative power (?) Person 1 (b) Person 2 Person 3 … Person 20 Length Hair color Distance matrix (e.g. Euclidean) Agglomerative hierarchical clustering Binary tree, (hypothetical) Dendrogram P1 P2 P3 … P20 Interest in football 2 2 clusters P1 0 … 4 P2 0 (c) 1 ‘linkage’ P3 0 3 … 0 Length P20 0 Hair color Agglomerative hierarchical clustering
Doc 2 Doc 3 Doc n Towards Mapping Library and Information Science Frizo Janssensa,*, Jacqueline Letab,c, Wolfgang B-3000 Leuven (Belgium) c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil dHungarian Academy of Sciences, Institute for Research Policy Studies, Nádor u. 18, H-1051 Budapest (Hungary) * Corresponding author: Frizo Janssens, Katholieke Universiteit Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-300 Digital documents … Doc 1 Text extraction .txt .txt .txt .txt … Neglect structure, stop word removal, stemming, phrase detection, … ‘Bags of words’ remain … ‘Indexing’, weighting (e.g., TF-IDF) Term-by-document matrix A Doc 1 Doc 2 Doc 3 ... Doc n Term 1 0.4 0.2 0 ... 0 v o c a b u l a r y Term 2 Term 2 0.1 0.55 0 ... 0 Similarity between documents= cosine of angle between vectors Term 3 0.25 0 0.12 ... 0 Doc 2 Term 4 0 0.16 0.24 ... 0.03 ... ... ... … ... ... Term m 0 0.21 0 ... 0.42 Doc 1 0.1 0.1 Term 1 Indexing in Vector Space Model
Bibliometrics and network analysis • Bibliographic coupling y x
Hybrid (integrated) clustering • Integrate complementary information • Textual content • Citations • Other bibliometric indicators • Intermediate integration • Pairwise distances calculated in separate spaces • Incorporated before clustering
Internal validation: number of clusters? • Dendrogram Text-based distance matrix Dtext documents 0 • Silhouette curves Hierarchical clustering 0 documents Integrated distance matrix Di 0 0 documents 0 • Text-based distances • Distances based on co-citation • or bibliographic coupling • Integrated distances 0 Distance matrix based on bibliometrics Dbibl documents Using 0 0 documents • Silhouette plot 0 0 documents 0 0 • Stability diagram • Weighted linear combination • Fisher’s inverse chi-square method Hybrid clustering: intermediate integration
700 140 000 Weighted linear combination (linco) • Di = α· Dtext + (1-α) ·DBIBL • Attractive, easy, and scalable • However, neglects differences in distributional characteristics ! • Histograms of mutual distances (<1) based on text (left) and BC (right) • Unequal or unfair contribution of data sources • Implicitly favoring text over bibliometric information or vice versa
Fisher’s inverse chi-square method • ‘Omnibus statistic’ from statistical meta-analysis • Combine p-values from multiple sources • Freed from distributional differences • Avoids overcompensation of either data source
distance matrices p-values documents documents documents a b c d 0 0 Dt e f g h y 0 0 ‘real’ text data documents documents terms i j k l 0 0 m n o p 0 p1 0 y q r s t randomize p-value p1 documents Integrated p-values documents k b l g 1 0 n e r q randomized text data cdf Cumul. share 0 terms h j d t documents documents 0 a s m i 0 0 Di y 0 p f c o dist 1 0 documents 0 documents documents pi 0 11 17 7 15 1 0 Fisher’s omnibus: 19 4 1 12 cdf 0 Cumul. share randomized citation data citations documents 2 18 9 6 pi = -2 ·log(p1λ· p21-λ) 0 8 16 13 14 0 z 0 dist 1 20 5 10 3 randomize p-value p2 documents documents documents 1 2 3 4 0 0 Dbc 5 6 7 8 z 0 0 ‘real’ citation data documents citations 9 10 11 12 documents 0 0 13 14 15 16 0 z p2 0 17 18 19 20 Fisher’s inverse chi-square method
Fisher’s inverse chi-square method • Histogram of pairwise document distances for text and BC • Histogram of p-values for real data w.r.t. randomized datasets
Conclusions from previous research • Text-only >> cited references • SVD greatly ameliorates results, especially for text (LSI) • Best performance: integration ! • Fisher's inverse chi-square • Significantly > text-only, link-only, & concatenation • No significant difference with linco’s when SVD • Generic, incorporate distances with highly dissimilar distributions • Weighted linco: good option if LSI is used • F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.
Dynamic hybrid mapping of bioinformatics Total: 7401
Dendrogram • 1. RNA structure prediction • 2. Protein structure prediction • 3. Systems biology & molecular networks • 4. Phylogeny & evolution • 5. Genome sequencing & assembly • 6. Gene/promoter/motif prediction • 7. Molecular DBs & annotation platforms • 8. Multiple sequence alignment • 9. Microarray analysis
Conclusions • Main contributions • Hybrid clustering (of bioinformatics) • Clustering and classification significantly improved • Generic: other application domains • Further Research • Fuzzy clustering • Semi-supervised clustering and active learning • Spectral clustering • Other matrix decompositions (e.g., NMF) • Multilinear (tensor) algebra • Mapping the world’s total yearly publication output • Detect emerging and converging clusters & hot topics • Science-technology interaction
? &