1 / 31

Shaillay Dogra, Ramesh Hariharan and Kalyanasundaram Subramanian Strand Life Sciences Pvt. Ltd

Combining Natural Language Processing with Substructure Search for efficient mining of Scientific literature. Shaillay Dogra, Ramesh Hariharan and Kalyanasundaram Subramanian Strand Life Sciences Pvt. Ltd. Background.

Download Presentation

Shaillay Dogra, Ramesh Hariharan and Kalyanasundaram Subramanian Strand Life Sciences Pvt. Ltd

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Natural Language Processing with Substructure Search for efficient mining of Scientific literature Shaillay Dogra, Ramesh Hariharan and Kalyanasundaram Subramanian Strand Life Sciences Pvt. Ltd

  2. Background • During the lead design/optimization phase, only the interaction between the lead and its target is investigated. • Interactions with different targets that could be potentially undesirable are not studied. • Undesirable interactions usually only become apparent at later stages of the discovery process - in vivo

  3. Solutions that Currently Exist • Run experimental assays to determine undesirable interactions • A prospective panel of “side-effect” related assays e.g. for kinases • The need for the assay may arise due to side-effects observed in animal studies or liabilities known about the target class • Synthesis and assay costs in conducting these experiments are considerable. • What has not been checked may be missed • Run a search engine like “QueryChem” with ‘structure and keyword’ • Need to predefine the keyword – results limited by what you define • Display of results is not intuitive or user-friendly • Further refinement or exploration on these results is unwieldy

  4. Structures are first searched against public databases • ‘Text’ names of the ‘hits’ so obtained are then combined with user defined keywords and again used to search information from the internet. Justin Klekota, Frederick P. Roth and Stuart L. Schreiber Bioinformatics 2006 22(13):1670-1673

  5. Results of valproate & hERG binding

  6. Issues with this approach • Only looks for co-occurrences of the compound and the keyword • Hence, potentially misses lot of interactions • The result of a search is a (long) text list • not easy to examine • no real analysis is possible • What could be an alternate approach • Cover as many biological interactions as currently available in literature • Show results in a user-friendly and intuitive manner • Allow further refinements of search and exploration in a dynamic manner

  7. The Workflow • ‘Draw’ the structure of a ‘query’ compound, • Run a similarity or sub-structure search against ‘target’ compounds in an ‘interactions’ database, • define ‘hit’ compounds ‘similar’ to the ‘query’ compound • check the interactions of these ‘hit’ compounds • A network(s) of interactions for the given compound is obtained… • Networks can be analyzed - provides a means of understanding the potential liabilities of the scaffold under consideration

  8. Basic Assumptions • Similarity principle • Similar compounds will most likely have similar biological interactions • The presence of a pre-mapped interactions database that is remains current with latest literature • The presence of small molecules within the database along with their structures that affords sub-structure and similarity searching

  9. Interaction Database Creation • Database created using NLP • Protein, genes and small molecule interactions captured "TLR-2 expression on monocytes was enhanced by macrophage colony-stimulating factor (M-CSF) and interleukin-10 (IL-10), but was reduced by transforming growth factor beta1.

  10. Entity Recognition Phase Information Extraction Phase NLP Schemata Input sentence Tagged sentence Interactions • Glucose-6-phosphatasewas found to play a role in the regulation ofinsulin. • Awas found to play a role in the regulation ofB. • Glucose-6-phosphatase insulin regulation

  11. Create dictionaries of protein names, small molecules etc. Identify alternative names/synonyms/symbols Resolve ambiguities Entity Recognition

  12. Information Extraction • First understand sentence structure • Syntax Analysis • Understand meaning • Semantic Analysis • Final interaction extraction • Inferencing

  13. Mammal Interaction Database Mammal [human, mouse, rat]

  14. Step 1 – Draw (sub)structure

  15. Step 2 - Perform similarity (or) substructure search

  16. Step 3 - Build network with hits

  17. Example of Interaction Network

  18. Analyze Network • Relevance interactions: binding, transcription, post-translational, small molecules, metabolism or transport regulation etc. • Interaction networks: shortest path network, network regulators, network targets etc • Advanced analysis: relevance list, custom relevance interactions, custom interaction network etc • Enrichment analysis: GO group enrichment, similar pathways etc • Numerical data analysis: If present

  19. Case Study Potential hepatotoxic side-effects of lead molecules

  20. 1 – Draw Structure and perform search

  21. 2 - Gather Hits

  22. 3- Generate Network Cholestatic Role of Chlorpromazine

  23. 4- Analyze Biological Processes Chlorpromazine

  24. Case Study 2

  25. Hits matching • Multiple Matches found • including amiodarone • Network created with a focus on the liver • Analysis performed on the results

  26. Interactions of amiodarone in the liver

  27. Steatosis Network

  28. Table view of interactions and proteins Amiodarone

  29. Conclusions • Combining structure based searches along with an interaction database allows the in silico assessment of the potential liabilities of a lead molecule • We have performed text-mining using Natural Language Processing (NLP). The approach uses both syntactic and semantic analysis of sentences along with inferencing. • We have applied NLP on PubMed abstracts to create a database of interactions containing proteins, small molecules and genes • We can perform similarity and sub-structure searches against this database to generate a network based on hits • We have demonstrated this approach in two cases to show scaffold liabilities for hepatotoxicity

  30. Acknowledgements • Pathway ArchitectTM Team • SarchitectTM Team • Vaijayanti Gupta • R. Nalini

  31. Thank You

More Related