Navigating Biomedical Funding Sources Using Text Mining and Machine Learning

Where Do You Go forBiomedical Funding? Yi Liu, Ahmet Altay

Background • Problem • In biomedical research there are many sources of federal funding. • How to choose the right institution for funding for a given research idea? • Data • Biomedical grant summaries from 20 institutions between the period 1972 and 2009

Pre-Processing • Clean up texts from mark-up/meta words/duplicates • Remove institutions with less than 5000 grant information • Bag-of-words approach with a pre-determined dictionary • Removed 319 stop words from text • Used stemming (Porter) to further collapse text • Dictionary size of 83485 with 120636 distinct spellings • Use mgrep to annotate our data with dictionary words

Histogram for Stems per Abstract

Processing • Generate a TFIDF matrix given the dictionary and abstracts • TFIDF matrix is huge (83435 by 561769) • Reduce TFIDF matrix for computational efficieny • Remove zero dictionary counts and abstracts • Use SVD and represent use a smaller sub-space of original matrix • Singular values decrease quickly. We used first 100 eigen vectors without losing much precision.

Distribution of Singular Values

Effect of Using Eigen Sub-space • Tested performance of smaller data set (400). • Performance of raw TFIDF is similar to eigen sub-space.

Evaluation • For a given test abstract we used kNN search to find 100 closest abstracts. • Used a custom scoring algorithm to pick a grantor that best represents 100 nearest neighbors found: • Tested entire data set using Leave-1-out cross-validation

Results (1)

Results (2)

Navigating Biomedical Funding Sources Using Text Mining and Machine Learning