1 / 10

Navigating Biomedical Funding Sources Using Text Mining and Machine Learning

Discover the best funding institutions for your biomedical research project by leveraging text mining and machine learning techniques on grant summaries from 20 institutions between 1972 and 2009. Pre-processing involves cleaning up texts, removing duplicates, and using a bag-of-words approach with a pre-determined dictionary. Explore the efficacy of reducing the TFIDF matrix for computational efficiency and employing SVD to work in a smaller eigen sub-space, without losing precision. Evaluate the performance using kNN search and a custom scoring algorithm to identify the ideal grantor from the 100 closest abstracts. Enhance your grants search strategy with data-driven insights.

Download Presentation

Navigating Biomedical Funding Sources Using Text Mining and Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Where Do You Go forBiomedical Funding? Yi Liu, Ahmet Altay

  2. Background • Problem • In biomedical research there are many sources of federal funding. • How to choose the right institution for funding for a given research idea? • Data • Biomedical grant summaries from 20 institutions between the period 1972 and 2009

  3. Pre-Processing • Clean up texts from mark-up/meta words/duplicates • Remove institutions with less than 5000 grant information • Bag-of-words approach with a pre-determined dictionary • Removed 319 stop words from text • Used stemming (Porter) to further collapse text • Dictionary size of 83485 with 120636 distinct spellings • Use mgrep to annotate our data with dictionary words

  4. Histogram for Stems per Abstract

  5. Processing • Generate a TFIDF matrix given the dictionary and abstracts • TFIDF matrix is huge (83435 by 561769) • Reduce TFIDF matrix for computational efficieny • Remove zero dictionary counts and abstracts • Use SVD and represent use a smaller sub-space of original matrix • Singular values decrease quickly. We used first 100 eigen vectors without losing much precision.

  6. Distribution of Singular Values

  7. Effect of Using Eigen Sub-space • Tested performance of smaller data set (400). • Performance of raw TFIDF is similar to eigen sub-space.

  8. Evaluation • For a given test abstract we used kNN search to find 100 closest abstracts. • Used a custom scoring algorithm to pick a grantor that best represents 100 nearest neighbors found: • Tested entire data set using Leave-1-out cross-validation

  9. Results (1)

  10. Results (2)

More Related