480 likes | 596 Views
Literature-Based Knowledge Discovery using Natural Language Processing. Dimitar Hristovski, 1 PhD, Carol Friedman, 2 PhD, Thomas C Rindflesch, 3 PhD, B orut Peterlin, 4 MD PhD 1 Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia
E N D
Literature-Based Knowledge Discovery using Natural Language Processing Dimitar Hristovski,1 PhD, Carol Friedman,2 PhD, Thomas C Rindflesch,3 PhD, Borut Peterlin,4 MD PhD 1Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia 2Department of Biomedical Informatics, Columbia University, New York3National Library of Medicine, Bethesda, Maryland 4Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia e-mail: dimitar.hristovski@mf.uni-lj.si
Motivation • Overspecialization • Information overload • Large databases • Need and opportunity for computer supported knowledge discovery
Literature-based Discovery (LBD) • A method for automatically generating hypotheses (discoveries) from literature • Hypotheses have form:Concept1 –Relation– Concept2 • Example:Fish oil –Treats– Raynaud’s disease
Background • Swanson’s LBD paradigm: New Relation?e.g. Treats Concept X(Disease) e.g. Raynaud’s Concepts Y(Pathologycal or Cell Function, …) e.g. Blood viscosity Concepts Z(Drugs, …) e.g. Fish oil
Biomedical Discovery Support System (BITOLA) • Goal: • discover potentially new relations (knowledge) between biomedical concepts • to be used as research idea generator and/or as • an alternative way to search Medline • System user (researcher or intermediary): • interactively guides the discovery process • evaluates the proposed relations
Extending and Enhancing Literature Based Discovery • Goal: • Make literature based discovery more suitable for disease candidate gene discovery • Decrease the number of candidate relations • Method: • Integrate background knowledge: • Chromosomal location of diseases and genes • Gene expression location • Disease manifestation location
System Overview Knowledge Base Concepts Background Knowledge (Chromosomal Locations, …) Discovery Algorithm Association Rules User Interface Knowledge Extraction Databases (Medline, LocusLink, HUGO, OMIM, …)
Terminology Problems during Knowledge Extraction • Gene names • Gene symbols • MeSH and genetic diseases
type|666548 II|552584 III|201776 component|179643 CT|175973 AT|151337 ATP|147357 IV|123429 CD4|99657 p53|89357 MR|88682 SD|85889 GH|84797 LPS|68982 59|67272 E2|64616 82|63521 AMP|61862 TNF|59343 RA|58818 CD8|57324 O2|56847 ACTH|54933 CO2|53171 PKC|51057 EGF|50483 T3|49632 MS|46813 A2|44896 ER|43212 upstream|41820 PRL|41599 Detected Gene Symbols by Frequency
Gene Symbol Disambiguation • Find MEDLINE docs in which we can expect to find gene symbols • Example of false positive: • Ethics in a twist: "Life Support", BBC1. BMJ 1999 Aug 7;319(7206):390 • breast basic conserved 1 (BBC1) gene, v.s. BBC1 television station featuring new drama series Life Support
Binary Association Rules • XY (confidence, support) • If X Then Y (confidence, support) • Confidence = % of docs containing Y within the X docs • Support = number (or %) of docs containing both X and Y • The relation between X and Y not known. • Examples: • Multiple Sclerosis Optic Neuritis (2.02, 117) • Multiple Sclerosis Interferon-beta (5.17, 300)
Discovery Algorithm Candidate Gene? Concepts Y(Pathologycal or Cell Function, …) Concept X(Disease) Concepts Z(Genes) Chromosomal Region Match Chromosomal Location Manifestation Location Match Expression Location
Y1 Z1 Y2 Z2 Y3 Z3 X … Yi Zk … Yj Zn Ranking Concepts Z
Problem Size • Full Medline analyzed (cca 15,000,000 recs) • 87,000,000 association rules between 180,000 biomedical concepts
Bilateral Perisylvian Polymicrogiria - BPP (OMIM: 300388) • Polymicrogyria of the cerebral cortex is a developmental abnormality characterized by excessive surface convolution • Clinical characteristics: • Mental retardation • Epilepsy • Pseudobulbar palsy (paralysisof the face, throat, tongue and the chewing process) • X linked dominantinheritance
237 genes in Xq28 relation between semantic types Cell Movement and Gene or gene products 18 gene candidates Sublocalisation in the Xq28 15 gene candidates Tissue specific expression 2 gene candidates: L1CAM and FLNA
Part 1: Summary and Conclusions • Discovery support system (BITOLA) presented • The system can be used as: • Research idea generator, or • Alternative method of searching Medline • Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery
System Availability • URL: www.mf.uni-lj.si/bitola/
Current LBD Systems • Co-occurrence based • Concepts • Title/Abstract Words/Phrases • MeSH • UMLS • Genes ... • UMLS Semantic types used for filtering • Semantic relations between concepts NOT used
Drawbacks of Current LBD • Not all co-occurrences represent a relation • Users have to read many Medline citations when reviewing candidate relations • Many spurious (false-positive) relations and hypotheses produced • No explanation of proposed hypotheses
Enhancing the LBD paradigm • Use semantic relations obtained from • two NLP systems (BioMedLee and SemRep) to augment • co-occurrence based LBD system (BITOLA)
Discovery Patterns • Discovery pattern: Set of conditions to be satisfied for the generation of new hypotheses • Conditions are combinations of semantic relations between concepts • Maybe_Treats pattern in this research – has two forms: • Maybe_Treats1 • Maybe_Treats2
Maybe_Treats Discovery Pattern Maybe_Treats1 Substance Y1(or Body meas., Body funct.) Drug Z1(or substance) Opposite_Change1 Change1 Disease X Disease X2 Substance Y2(or Body meas., Body funct.) Same Change2 Change2 Treats Drug Z2(or substance) Maybe_Treats2
Maybe_Treats1 and Maybe_Treats2 • Goal:Propose potentially new treatments • Can work in concert: • Propose different treatments (complementary) • Propose same treatments using different discovery reasoning (reinforcing)
Multiple Usages of Maybe_Treats • Given Disease X as input: • find new treatments Z • Given Drug Z as input: • find diseases X that can be treated • Given Disease X and Drug Z as input: • test whether Z can be used to treat X
Semantic Relations Used • Associated_with_change and Treats used to extract known facts from the literature • Then Maybe_Treats1 and Maybe_Treats2 predict new treatments based on the known extracted facts
Associated_with_change • One concept associated with a change in another concept, for example: • Associated_with(Raynaud’s, Blood viscosity, increase): • “Local increase of blood viscosity during cold-induced Raynaud's phenomenon.” • “Increasedviscosity might be a causal factor in secondary forms of Raynaud's disease, …” • BioMedLee (Friedman et al) used to extract Associated_with_change
Treats • Used to extract drugs known to treat a disease • Major purpose in our approach: • Eliminate drugs already known to be used to treat a disease • Find existing treatments for similar diseases • TREATS(Amantadine,Huntington): • “…treatment of Huntington’s disease with amantadine…” • Treats extracted by SemRep (Rindflesch et al)
Huntington Disease • Inherited neurodegenerative disorder • All 5511 Huntington citations (Jan.2006) processed with BioMedLee and SemRep • 35 interesting concepts assoc.with change selected and corresponding citations (250.000) processed
Insulin for Huntington Disease • Assoc_with(Huntington,Insulin,decrease): • “Huntington's disease transgenic mice develop an age-dependent reduction of insulin mRNA expression and diminished expression of key regulators of insulin gene transcription, …” • Insulin also decreased in diabetes mellitus • Therapies used to regulate insulin in diabetes might be used for Huntington
Capsaicin for Huntington • Assoc_with(Huntington,Substance P,decrease): • “In Huntington's disease brains decreasedSubstance P staining was found in …” • Assoc_with(Capsaicin,Substance P,increase): • “Capsaicin also attenuated the increase in Substance P content in sciatic nerve, …” • Capsaicin maybe treats Huntington because Substance P is decreased in Huntington and Capsaicin increases Substance P.
Huntington Results - Summary Maybe_Treats1 Substance P(Substance Y1) Capsaicin(Drug Z1) Increase Decrease Huntington(Disease X) Diabetes M(Disease X2) Insulin(Substance Y2) Decrease Decrease Treats Insulin regulation ther. (Z2) Maybe_Treats2
Example: Parkinson disease as starting concept. Bellow shown some related concepts changed in association to Parkinson
Showing Supporting Sentenceswith highlighted concepts and relations
Gabapentine for Parkinson • Assoc_with(Parkinson,gamma-aminobutyric acid(GABA),decrease): • “…studies indicate that patients with Parkinson's disease have decreased basal ganglia gamma-aminobutyric acid function… ” • Assoc_with(GABA,Gabapentine,increase): • “Gabapentin, probably through the activation of glutamic acid decarboxylase, leads to the increase in synaptic GABA. ” • Explanation: Gabapentine maybe treats Parkinson because GABA is decreased in Parkinson and GabapentineincreasesGABA.
Part 2: Conclusions • A new method to improve LBD presented • Based on discovery patterns and semantic relations extracted by BioMedLee and SemRep, coupled with BITOLA LBD • Easier for the user to evaluate smaller number of hypotheses • Two potentially new therapeutic approaches for Huntington proposed and one for Parkinson • Raynaud’s—Fish oil discovery replicated
The future of Literature-based Discovery • Development of specific discovery patterns based on semantic relations and further integrated with co-occurrence-based LBD
Link, References and some propaganda • http://www.mf.uni-lj.si/bitola • Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature-based discovery to identify disease candidate genes. Int. J. Med. Inform. 2005. Vol. 74(2–4), pp. 289–298. Selected for Yearbook of Medical Informatics 2006 • Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based discovery. In Proc AMIA 2006 Symp; 2006. p. 349-53. • Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007 Symp; 2007. p. 6-10. “Distinguished Paper Award AMIA2007” • Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based Knowledge Discovery using Natural Language Processing. To appear as a chapter in the first LBD book in 2008