Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, Bart De Moor frizo.janssens@esat.kuleuven.be Katholieke Universiteit Leuven – ESAT/SCD – Steunpunt O&O Indicatoren

Overview of the presentation • Introduction • General context & objectives • Clustering • Text mining framework • Bibliometrics, citation analysis • Hybrid (integrated) clustering • Linear combination • Fisher’s inverse chi-square method • Dynamic hybrid mapping of bioinformatics • Conclusions • Further research

General context • Mapping of scientific and technological fields by using clustering algorithms and techniques from bibliometrics and text mining • Complementary views on document set → other perceptions of similarity • Textual information: amount of words in common • Citation networks, bibliometric properties • Goal: • Integrate text mining & bibliometrics (hybrid approach) • Better clustering and classification performance • Mapping cognitive structure and dynamics of bioinformatics

10 women 10 men ? features Hair color Length Hair color Person 1 Person 2 (a) Person 3 ‘objects’ … Person 20 Length Interest in football Length Interested in football Hair color More Discriminative power (?) Person 1 (b) Person 2 Person 3 … Person 20 Length Hair color Distance matrix (e.g. Euclidean) Agglomerative hierarchical clustering Binary tree, (hypothetical) Dendrogram P1 P2 P3 … P20 Interest in football 2 2 clusters P1 0 … 4 P2 0 (c) 1 ‘linkage’ P3 0 3 … 0 Length P20 0 Hair color Agglomerative hierarchical clustering

Doc 2 Doc 3 Doc n Towards Mapping Library and Information Science Frizo Janssensa,*, Jacqueline Letab,c, Wolfgang B-3000 Leuven (Belgium) c Instituto de Bioquímica Médica, Centro de Ciências da Saúde, Cidade Universitária, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil dHungarian Academy of Sciences, Institute for Research Policy Studies, Nádor u. 18, H-1051 Budapest (Hungary) * Corresponding author: Frizo Janssens, Katholieke Universiteit Leuven, ESAT-SCD, Kasteelpark Arenberg 10, B-300 Digital documents … Doc 1 Text extraction .txt .txt .txt .txt … Neglect structure, stop word removal, stemming, phrase detection, … ‘Bags of words’ remain … ‘Indexing’, weighting (e.g., TF-IDF) Term-by-document matrix A Doc 1 Doc 2 Doc 3 ... Doc n Term 1 0.4 0.2 0 ... 0 v o c a b u l a r y Term 2 Term 2 0.1 0.55 0 ... 0 Similarity between documents= cosine of angle between vectors Term 3 0.25 0 0.12 ... 0 Doc 2 Term 4 0 0.16 0.24 ... 0.03 ... ... ... … ... ... Term m 0 0.21 0 ... 0.42 Doc 1 0.1 0.1 Term 1 Indexing in Vector Space Model

Bibliometrics and network analysis • Bibliographic coupling y x

Hybrid (integrated) clustering • Integrate complementary information • Textual content • Citations • Other bibliometric indicators • Intermediate integration • Pairwise distances calculated in separate spaces • Incorporated before clustering

Internal validation: number of clusters? • Dendrogram Text-based distance matrix Dtext documents 0 • Silhouette curves Hierarchical clustering 0 documents Integrated distance matrix Di 0 0 documents 0 • Text-based distances • Distances based on co-citation • or bibliographic coupling • Integrated distances 0 Distance matrix based on bibliometrics Dbibl documents Using 0 0 documents • Silhouette plot 0 0 documents 0 0 • Stability diagram • Weighted linear combination • Fisher’s inverse chi-square method Hybrid clustering: intermediate integration

700 140 000 Weighted linear combination (linco) • Di = α· Dtext + (1-α) ·DBIBL • Attractive, easy, and scalable • However, neglects differences in distributional characteristics ! • Histograms of mutual distances (<1) based on text (left) and BC (right) • Unequal or unfair contribution of data sources • Implicitly favoring text over bibliometric information or vice versa

Fisher’s inverse chi-square method • ‘Omnibus statistic’ from statistical meta-analysis • Combine p-values from multiple sources • Freed from distributional differences • Avoids overcompensation of either data source

distance matrices p-values documents documents documents a b c d 0 0 Dt e f g h y 0 0 ‘real’ text data documents documents terms i j k l 0 0 m n o p 0 p1 0 y q r s t randomize p-value p1 documents Integrated p-values documents k b l g 1 0 n e r q randomized text data cdf Cumul. share 0 terms h j d t documents documents 0 a s m i 0 0 Di y 0 p f c o dist 1 0 documents 0 documents documents pi 0 11 17 7 15 1 0 Fisher’s omnibus: 19 4 1 12 cdf 0 Cumul. share randomized citation data citations documents 2 18 9 6 pi = -2 ·log(p1λ· p21-λ) 0 8 16 13 14 0 z 0 dist 1 20 5 10 3 randomize p-value p2 documents documents documents 1 2 3 4 0 0 Dbc 5 6 7 8 z 0 0 ‘real’ citation data documents citations 9 10 11 12 documents 0 0 13 14 15 16 0 z p2 0 17 18 19 20 Fisher’s inverse chi-square method

Fisher’s inverse chi-square method • Histogram of pairwise document distances for text and BC • Histogram of p-values for real data w.r.t. randomized datasets

Conclusions from previous research • Text-only >> cited references • SVD greatly ameliorates results, especially for text (LSI) • Best performance: integration ! • Fisher's inverse chi-square • Significantly > text-only, link-only, & concatenation • No significant difference with linco’s when SVD • Generic, incorporate distances with highly dissimilar distributions • Weighted linco: good option if LSI is used • F. Janssens, V. Tran Quoc, W. Glänzel, and B. De Moor. Integration of textual content and link information for accurate clustering of science fields. In Proceedings of the I International Conference on Multidisciplinary Information Sciences & Technologies (InSciT2006). Current Research in Information Sciences and Technologies, volume I, pages 615–619, Mérida, Spain, October 2006.

Dynamic hybrid mapping of bioinformatics Total: 7401

Number of clusters and LSI factors

Number of clusters: stability diagram

Number of clusters: link-based Silhouette values

Dendrogram • 1. RNA structure prediction • 2. Protein structure prediction • 3. Systems biology & molecular networks • 4. Phylogeny & evolution • 5. Genome sequencing & assembly • 6. Gene/promoter/motif prediction • 7. Molecular DBs & annotation platforms • 8. Multiple sequence alignment • 9. Microarray analysis

Dynamics

Dynamic term networks

Conclusions • Main contributions • Hybrid clustering (of bioinformatics) • Clustering and classification significantly improved • Generic: other application domains • Further Research • Fuzzy clustering • Semi-supervised clustering and active learning • Spectral clustering • Other matrix decompositions (e.g., NMF) • Multilinear (tensor) algebra • Mapping the world’s total yearly publication output • Detect emerging and converging clusters & hot topics • Science-technology interaction

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Presentation Transcript

Text-Mining: analysis of text data

Text Clustering

Data Mining and Bioinformatics

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Clustering: Introduction Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Chapter 16: Text Mining for Translational Bioinformatics

MLA Citation- In text citation

Bioinformatics: Spectral Clustering

In-text citation

In-Text Citation

Data Clustering and Mining

Text Clustering

732A02 Data Mining - Clustering and Association Analysis

In-text citation

In-text Citation

In Text Citation

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

732A02 Data Mining - Clustering and Association Analysis

Text Analysis and Knowledge Mining System

732A02 Data Mining - Clustering and Association Analysis

Text-Mining: analysis of text data

Opportunities for Text Mining in Bioinformatics