1 / 46

Exploring Chemical Space with Computers—Challenges and Opportunities

Exploring Chemical Space with Computers—Challenges and Opportunities. Pierre Baldi UCI. Chemical Informatics. Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology).

avye-pugh
Download Presentation

Exploring Chemical Space with Computers—Challenges and Opportunities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploring Chemical Space with Computers—Challenges and Opportunities Pierre Baldi UCI

  2. Chemical Informatics • Historical perspective: physics, chemistry and biology • Understanding chemical space • Small molecules (systems biology, chemical synthesis, drug design, nanotechnology)

  3. Chemical Space

  4. Chemical Space

  5. Chemical Informatics • Historical perspective: physics, chemistry and biology • Understanding chemical space • Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) • Predict physical, chemical, biological properties (classification/regression) • Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

  6. Methods • Spetrum: • Schrodinger Equation • Molecular Dynamics • Machine Learning (e.g. SS prediction)

  7. Chemical Informatics • Informatics must be able to deal with variable-size structured data • Graphical Models • (Recursive) Neural Networks • ILP • GA • SGs • Kernels

  8. Two Essential Ingredients • Data • Similarity Measures Bioinformatics analogy and differences: • Data (GenBank, Swissprot, PDB) • Similarity (BLAST)

  9. Data • Mutag (Mutagenicity) • 200 compounds (125/63), mutagenicity in Salmonella • PTC (Predictive Toxicity Challenge) • A few hundred compounds, carcinogenicity (FM,MM,FR,MR) • NCI (Anti-cancer activity) • 70,000 compounds screened for ability to inhibit growth in 60 human tumor cell lines • Alkanes (Boiling points) • All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-164,174]) • Benzodiazepines (QSAR) • 79 1,4-benzodiazepines-2-one, affinity towards GABAA • ChemDB • 7M compounds

  10. Similarity • Rapid Searches of Large Databases • Predictive Methods (Kernel Methods) • Why it is not hopeless?

  11. Organic Chemicals Similarity • Rapid Search of Large Databases • ProteinReceptor (Docking) • Small Molecule/Ligand (Similarity) • Predictive Methods (Kernel Methods) • Why it is not hopeless

  12. Linear Classifiers

  13. Classification • Learning to Classify • Limited number of training examples (molecules, patients, sequences, etc.) • Learning algorithm (how to build the classifier?) • Generalization: should correctly classify test data. • Formalization • X is the input space • Y (e.g. toxic/non toxic, or {1,-1}) is the target class • f: X→Y is the classifier.

  14. Classification • Fundamental Point: • f is entirely determined by the dot products xi,xj measuring the similarity between pairs of data points

  15. Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel.

  16. Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel K. • Fundamental property: the linear decision surface depends on K(xi ,xj)=(xi ) , (xj). • All we need is the Gram similarity matrix K. K defines the local metric of the embedding space.

  17. Similarity: Data Representations NC(O)C(=O)O

  18. Molecular Representations • 1D: SMILES strings • 2D: Graph of bonds • 2.5D: Surfaces • 3D: Atomic coordinates • 4D: Temporal evolution

  19. CCCCCCc1ccc(cc1O)O CCCCCc1ccc(cc1)CO 15 Total: 1D SMILES Kernel

  20. 2D Molecule Graph Kernel • For chemical compounds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … } • Count labeled paths • Fingerprints (CsNsCdO)

  21. Similarity Measures

  22. 2.8 A 2.0 A 4.2 A 1.4 A 3.4 A 3D Coordinate Kernel

  23. Example of Results

  24. Results

  25. Results

  26. Results

  27. Example of Results

  28. Summary • Derived a variety of kernels for small molecules • State-of-the-art performance on several benchmark datasets • 2D kernels slightly better than 1D and 3D kernels • Many possible extensions: 2.5D kernels, isomers, etc… • Need for larger data sets and new models of cooperation in the chemistry community • Many open (ML) questions (e.g. clustering and visualizing 107 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life) • Chemistry version of the Turing test

  29. ChemDB • 7M compounds (3.5M unique) • Commercially available • PostgreSQL/Oracle • Annotation (Experimental, Computational) • Searchable • Web interface • Similarity, in silico reactions

  30. Acknowledgements • Pharmacology • Daniele Piomelli • Chemistry • G. Weiss • J. S. Nowick • R. Chamberlin • Informatics • Liva Ralaivola • J. Chen • S. J. Swamidass • Yimeng Dou • Peter Phung • Jocelyne Bruand • Funding • NIH • NSF • IGB

  31. New Questions • Predict drug-like molecules? toxicity? • New Strategies • How can we search efficiently? Intelligently? • New data structures and algorithms • Optimizing old structures • How can we understand this much data? • Cluster and visualize millions of data points • Define commercially accessible space. • Are there other useful things we can do with this? • Discover new polymers, etc. • Wonder about the origin of life. • Combinatorially combine all known chemicals.

  32. Acknowledgements ? • Jocelyne Bruand • Peter Phung • Liva Ralaivola • S. Joshua Swamidass • Yimeng Dou • NIH/NSF/IGB Questions

  33. Query: Binding Site of Protein Scoring Function & Efficient Minimizer Database of potential drugs 6 million small molecules … Docking

  34. Some Targets • P53 (Luecke) • ACCD5 (Tsai) • IMPDH, PPAR, etc. (Luecke) • HIV Integrase (Robinson)

  35. P53

  36. Drug Rescue of P53 Mutants

  37. Docking → ChemDB • ~6 million commercially available compounds • Searchable, annotated, downloadable. • Other Databases: • Cambridge Structural Database • ChemBank • PubChem

  38. Chemical Toxicity Prediction By Kernel Methods Jonathan Chen S Joshua Swamidass The Baldi Lab

  39. ID Toxic? Gram Matrix 1 No 2 No 3 Yes Toxicity State List 4 Yes Data Flow Kernel Linear Classifier Predictions

  40. Results

  41. Example of Results Kernel/Method Mutag MM FM MR FR Kashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.1 1D SMILES spec. 84.0 66.1 61.3 57.3 66.1 1D SMILES spec+ 85.6 66.4 63.0 57.6 67.0 2D Tanimoto 87.8 66.4 64.2 63.7 66.7 2D MinMax 86.2 64.0 64.5 64.5 66.4 2D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.9 2D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.8 2D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.1 2D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.7 2D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.7 2D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.5 2D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.4 2D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.4 3D Histogram 81.9 59.8 61.0 60.8 64.4

  42. Chemical Informatics • Historical perspective: physics, chemistry and biology • Understanding chemical space • Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) • Catalog • Predict physical, chemical, biological properties • Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

  43. Datasets

  44. Small Molecules as Undirected Labeled Graphs of Bonds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … }

  45. Chemical Informatics • Historical perspective: physics, chemistry and biology • Understanding chemical space • Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) • Bioinformatics analogy: • Catalog (GenBank) • Search (BLAST) • Predict physical, chemical, biological properties • Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

  46. Chemical Informatics • Historical perspective: physics, chemistry and biology • Understanding chemical space • Small molecules (systems biology, chemical synthesis, drug design, nanotechnology) • Bioinformatics analogy: • Catalog (GenBank) • Search (BLAST) • Predict physical, chemical, biological properties • Build filters/tools to efficiently navigate chemical space to discover new drugs, new galaxies, etc.

More Related