1 / 45

Literature-Based Knowledge Discovery using Natural Language Processing

Literature-Based Knowledge Discovery using Natural Language Processing. Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD, B orut Peterlin, 4 MD PhD 1 Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia

ora
Download Presentation

Literature-Based Knowledge Discovery using Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Literature-Based Knowledge Discovery using Natural Language Processing Dimitar Hristovski,1 PhD, Carol Friedman,2 PhD, Thomas C Rindflesch,3 PhD, Borut Peterlin,4 MD PhD 1Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia 2Department of Biomedical Informatics, Columbia University, New York3National Library of Medicine, Bethesda, Maryland 4Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia e-mail: dimitar.hristovski@mf.uni-lj.si

  2. Part 1: Co-occurrence based LBD

  3. Motivation • Overspecialization • Information overload • Large databases • Need and opportunity for computer supported knowledge discovery

  4. Literature-based Discovery (LBD) • A method for automatically generating hypotheses (discoveries) from literature • Hypotheses have form:Concept1 –Relation– Concept2 • Example:Fish oil –Treats– Raynaud’s disease

  5. Background • Swanson’s LBD paradigm: New Relation?e.g. Treats Concept X(Disease) e.g. Raynaud’s Concepts Y(Pathologycal or Cell Function, …) e.g. Blood viscosity Concepts Z(Drugs, …) e.g. Fish oil

  6. Biomedical Discovery Support System (BITOLA) • Goal: • discover potentially new relations (knowledge) between biomedical concepts • to be used as research idea generator and/or as • an alternative way to search Medline • System user (researcher or intermediary): • interactively guides the discovery process • evaluates the proposed relations

  7. Extending and Enhancing Literature Based Discovery • Goal: • Make literature based discovery more suitable for disease candidate gene discovery • Decrease the number of candidate relations • Method: • Integrate background knowledge: • Chromosomal location of diseases and genes • Gene expression location • Disease manifestation location

  8. System Overview Knowledge Base Concepts Background Knowledge (Chromosomal Locations, …) Discovery Algorithm Association Rules User Interface Knowledge Extraction Databases (Medline, LocusLink, HUGO, OMIM, …)

  9. Terminology Problems during Knowledge Extraction • Gene names • Gene symbols • MeSH and genetic diseases

  10. type|666548 II|552584 III|201776 component|179643 CT|175973 AT|151337 ATP|147357 IV|123429 CD4|99657 p53|89357 MR|88682 SD|85889 GH|84797 LPS|68982 59|67272 E2|64616 82|63521 AMP|61862 TNF|59343 RA|58818 CD8|57324 O2|56847 ACTH|54933 CO2|53171 PKC|51057 EGF|50483 T3|49632 MS|46813 A2|44896 ER|43212 upstream|41820 PRL|41599 Detected Gene Symbols by Frequency

  11. Gene Symbol Disambiguation • Find MEDLINE docs in which we can expect to find gene symbols • Example of false positive: • Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390 • breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support

  12. Binary Association Rules • XY (confidence, support) • If X Then Y (confidence, support) • Confidence = % of docs containing Y within the X docs • Support = number (or %) of docs containing both X and Y • The relation between X and Y not known. • Examples: • Multiple Sclerosis  Optic Neuritis (2.02, 117) • Multiple Sclerosis  Interferon-beta (5.17, 300)

  13. Discovery Algorithm Candidate Gene? Concepts Y(Pathologycal or Cell Function, …) Concept X(Disease) Concepts Z(Genes) Chromosomal Region Match Chromosomal Location Manifestation Location Match Expression Location

  14. Y1 Z1 Y2 Z2 Y3 Z3 X … Yi Zk … Yj Zn Ranking Concepts Z

  15. Problem Size • Full Medline analyzed (cca 15,000,000 recs) • 87,000,000 association rules between 180,000 biomedical concepts

  16. Bilateral Perisylvian Polymicrogiria - BPP (OMIM: 300388) • Polymicrogyria of the cerebral cortex is a developmental abnormality characterized by excessive surface convolution • Clinical characteristics: • Mental retardation • Epilepsy • Pseudobulbar palsy (paralysisof the face, throat, tongue and the chewing process) • X linked dominantinheritance

  17. 237 genes in Xq28 relation between semantic types Cell Movement and Gene or gene products 18 gene candidates Sublocalisation in the Xq28 15 gene candidates Tissue specific expression 2 gene candidates: L1CAM and FLNA

  18. User Interface “cgi-bin” version

  19. Automatically search for supporting Medline Citations

  20. Part 1: Summary and Conclusions • Discovery support system (BITOLA) presented • The system can be used as: • Research idea generator, or • Alternative method of searching Medline • Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery

  21. System Availability • URL: www.mf.uni-lj.si/bitola/

  22. Part 2: Exploring Semantic Relations for LBD

  23. Current LBD Systems • Co-occurrence based • Concepts • Title/Abstract Words/Phrases • MeSH • UMLS • Genes ... • UMLS Semantic types used for filtering • Semantic relations between concepts NOT used

  24. Drawbacks of Current LBD • Not all co-occurrences represent a relation • Users have to read many Medline citations when reviewing candidate relations • Many spurious (false-positive) relations and hypotheses produced • No explanation of proposed hypotheses

  25. Enhancing the LBD paradigm • Use semantic relations obtained from • two NLP systems (BioMedLee and SemRep) to augment • co-occurrence based LBD system (BITOLA)

  26. Methods

  27. Discovery Patterns • Discovery pattern: Set of conditions to be satisfied for the generation of new hypotheses • Conditions are combinations of semantic relations between concepts • Maybe_Treats pattern in this research – has two forms: • Maybe_Treats1 • Maybe_Treats2

  28. Maybe_Treats Discovery Pattern Maybe_Treats1 Substance Y1(or Body meas., Body funct.) Drug Z1(or substance) Opposite_Change1 Change1 Disease X Disease X2 Substance Y2(or Body meas., Body funct.) Same Change2 Change2 Treats Drug Z2(or substance) Maybe_Treats2

  29. Maybe_Treats1 and Maybe_Treats2 • Goal:Propose potentially new treatments • Can work in concert: • Propose different treatments (complementary) • Propose same treatments using different discovery reasoning (reinforcing)

  30. Multiple Usages of Maybe_Treats • Given Disease X as input: • find new treatments Z • Given Drug Z as input: • find diseases X that can be treated • Given Disease X and Drug Z as input: • test whether Z can be used to treat X

  31. Semantic Relations Used • Associated_with_change and Treats used to extract known facts from the literature • Then Maybe_Treats1 and Maybe_Treats2 predict new treatments based on the known extracted facts

  32. Associated_with_change • One concept associated with a change in another concept, for example: • Associated_with(Raynaud’s, Blood viscosity, increase): • “Local increase of blood viscosity during cold-induced Raynaud's phenomenon.” • “Increasedviscosity might be a causal factor in secondary forms of Raynaud's disease, …” • BioMedLee (Friedman et al) used to extract Associated_with_change

  33. Treats • Used to extract drugs known to treat a disease • Major purpose in our approach: • Eliminate drugs already known to be used to treat a disease • Find existing treatments for similar diseases • TREATS(Amantadine,Huntington): • “…treatment of Huntington’s disease with amantadine…” • Treats extracted by SemRep (Rindflesch et al)

  34. Results

  35. Huntington Disease • Inherited neurodegenerative disorder • All 5511 Huntington citations (Jan.2006) processed with BioMedLee and SemRep • 35 interesting concepts assoc.with change selected and corresponding citations (250.000) processed

  36. Insulin for Huntington Disease • Assoc_with(Huntington,Insulin,decrease): • “Huntington's disease transgenic mice develop an age-dependent reduction of insulin mRNA expression and diminished expression of key regulators of insulin gene transcription, …” • Insulin also decreased in diabetes mellitus • Therapies used to regulate insulin in diabetes might be used for Huntington

  37. Capsaicin for Huntington • Assoc_with(Huntington,Substance P,decrease): • “In Huntington's disease brains decreasedSubstance P staining was found in …” • Assoc_with(Capsaicin,Substance P,increase): • “Capsaicin also attenuated the increase in Substance P content in sciatic nerve, …” • Capsaicin maybe treats Huntington because Substance P is decreased in Huntington and Capsaicin increases Substance P.

  38. Huntington Results - Summary Maybe_Treats1 Substance P(Substance Y1) Capsaicin(Drug Z1) Increase Decrease Huntington(Disease X) Diabetes M(Disease X2) Insulin(Substance Y2) Decrease Decrease Treats Insulin regulation ther. (Z2) Maybe_Treats2

  39. Example: Parkinson disease as starting concept. Bellow shown some related concepts changed in association to Parkinson

  40. Potential Treatments for Parkinson (e.g. gabapentine)

  41. Showing Supporting Sentenceswith highlighted concepts and relations

  42. Gabapentine for Parkinson • Assoc_with(Parkinson,gamma-aminobutyric acid(GABA),decrease): • “…studies indicate that patients with Parkinson's disease have decreased basal ganglia gamma-aminobutyric acid function… ” • Assoc_with(GABA,Gabapentine,increase): • “Gabapentin, probably through the activation of glutamic acid decarboxylase, leads to the increase in synaptic GABA. ” • Explanation: Gabapentine maybe treats Parkinson because GABA is decreased in Parkinson and GabapentineincreasesGABA.

  43. Part 2: Conclusions • A new method to improve LBD presented • Based on discovery patterns and semantic relations extracted by BioMedLee and SemRep, coupled with BITOLA LBD • Easier for the user to evaluate smaller number of hypotheses • Two potentially new therapeutic approaches for Huntington proposed and one for Parkinson • Raynaud’s—Fish oil discovery replicated

  44. The future of Literature-based Discovery • Development of specific discovery patterns based on semantic relations and further integrated with co-occurrence-based LBD

  45. Link, References and some propaganda • http://www.mf.uni-lj.si/bitola • Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature-based discovery to identify disease candidate genes. Int. J. Med. Inform. 2005. Vol. 74(2–4), pp. 289–298.  Selected for Yearbook of Medical Informatics 2006 • Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based discovery. In Proc AMIA 2006 Symp; 2006. p. 349-53. • Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007 Symp; 2007. p. 6-10.  “Distinguished Paper Award AMIA2007” • Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based Knowledge Discovery using Natural Language Processing.  To appear as a chapter in the first LBD book in 2008

More Related