100 likes | 118 Views
Discover the best funding institutions for your biomedical research project by leveraging text mining and machine learning techniques on grant summaries from 20 institutions between 1972 and 2009. Pre-processing involves cleaning up texts, removing duplicates, and using a bag-of-words approach with a pre-determined dictionary. Explore the efficacy of reducing the TFIDF matrix for computational efficiency and employing SVD to work in a smaller eigen sub-space, without losing precision. Evaluate the performance using kNN search and a custom scoring algorithm to identify the ideal grantor from the 100 closest abstracts. Enhance your grants search strategy with data-driven insights.
E N D
Where Do You Go forBiomedical Funding? Yi Liu, Ahmet Altay
Background • Problem • In biomedical research there are many sources of federal funding. • How to choose the right institution for funding for a given research idea? • Data • Biomedical grant summaries from 20 institutions between the period 1972 and 2009
Pre-Processing • Clean up texts from mark-up/meta words/duplicates • Remove institutions with less than 5000 grant information • Bag-of-words approach with a pre-determined dictionary • Removed 319 stop words from text • Used stemming (Porter) to further collapse text • Dictionary size of 83485 with 120636 distinct spellings • Use mgrep to annotate our data with dictionary words
Processing • Generate a TFIDF matrix given the dictionary and abstracts • TFIDF matrix is huge (83435 by 561769) • Reduce TFIDF matrix for computational efficieny • Remove zero dictionary counts and abstracts • Use SVD and represent use a smaller sub-space of original matrix • Singular values decrease quickly. We used first 100 eigen vectors without losing much precision.
Effect of Using Eigen Sub-space • Tested performance of smaller data set (400). • Performance of raw TFIDF is similar to eigen sub-space.
Evaluation • For a given test abstract we used kNN search to find 100 closest abstracts. • Used a custom scoring algorithm to pick a grantor that best represents 100 nearest neighbors found: • Tested entire data set using Leave-1-out cross-validation