380 likes | 517 Views
A Fast Tour of Predictive Cheminformatics Curt Breneman Perspectives in Chemistry October 19, 2011. The Problem Domain…. Data Management Core Curation and Standardization *** Data Collection Database Management Model Implementation. RECCR Cheminformatics Center. Descriptor Core
E N D
A Fast Tour of Predictive Cheminformatics Curt Breneman Perspectives in Chemistry October 19, 2011
Data Management Core Curation and Standardization *** Data Collection Database Management Model Implementation RECCR Cheminformatics Center Descriptor Core HTS Descriptors*** Structure-based design Ligand-based design Molecular Similarity Descriptor benchmarks Modeling Core Alternate Model Fusion*** Task-targeted modeling Multi-objective Learning Applicability Domains Model benchmarks Software Engineering and Dissemination Algorithm Implementation *** Computing Visualization Database Development User Interface Support and Documentation
The Evolution of Informatics ~4000 BC Experiments, observations, records
The Evolution of Informatics ~4000 BC Experiment ~1700 AD Theory, formalism, publication
The Evolution of Informatics ~4000 BC Experiment ~1700 AD Theory ~1950+ Computation – first steps
The Evolution of Informatics ~4000 BC Experiment ~1700 AD Theory ~1950+ Computation ~1970+ Simulation
The Evolution of Informatics ~4000 BC Experiment ~1700 AD Theory ~1950+ Computation ~1970+ Simulation ~1990+ Cheminformatics&Data Mining
The Data Mining Process WISDOM UNDERSTANDING KNOWLEDGE INFORMATION DATA
Data Mining a Data-Rich Environment Experiment No Prior Hypothesis
Intersection of Chemistry and Biology 6627 small molecules 151 diverse assays
Mapping Chemistry to Biology Alignment-free Molecular Property Descriptors Multi-Latent Analysis Modeling Tools pH 4 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 pH 5 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Mol 15 Mol 16 Mol 17 pH 6 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Mol 15 Mol 16 Mol 17 pH 7 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Mol 15 pH 8 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Multi-Objective Learning Non-linear Model Building and Validation Methods
Predictive Cheminformatics Workflow MOLECULAR STRUCTURE DATASET NECCR, PubChem MLSCN, MLI, PDB, corporate partners ACTIVITY MODEL MOLECULAR DESCRIPTORS Synthesis, Assay MOLECULAR ENVIRONMENT PREDICTED ACTIVITY
Knowledge Discovery and Data Fusion Domain expert molecular understanding FUSED DATA Database #1 Database #n
Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors + Activity Modeling Bioactivity = Molecular Structures Descriptors Model Activity
Molecular Structures Model Activity Representing Molecular Structure AAACCTCATAGGAAGCATACCAGGAATTACATCA…
Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors Constitutional Descriptors Electrostatic Descriptors Quantum-chemical Descriptors Thermodynamic Descriptors Descriptor Types Molecular Structures Descriptors Model Activity
Descriptor Selection • What features of a molecule are related to my activity? • What descriptors can capture that information? Molecular Structures Descriptors Model Activity
Surface Property Distribution Histograms Molecular surface property distributions can be represented as RECON/TAE histogram bin descriptors (RECON/TAE) Descriptors
EP MLP PROLICSS Surface 6CPA histogram histogram PROLICSS: Protein-Ligand Complementary Surface Scoring
PMF expansion-based hydration patterns • Developing an efficient alternative to full simulations by means of a potentials-of-mean-force expansion • employing a library of lower-order correlation functions derived from explicit simulations to predict the average equilibrium density and the orientation profile of water in the space surrounding biomolecules or ligands. Water density values in space surrounding an alpha-helix (left) and a protein X (right) predicted using the PMF expansion (cyan) and obtained from exact simulation (magenta)
Predictive Cheminformatics Learning from the past to predict the future…
Challenges in Predictive Modeling • “First there are the known knowns” • These are the things that we know we know • “Then there are the known unknowns” • These are the things that we now know we do not know • “Finally there are also the unknown unknowns” • These are the things that we do not yet know we do not know • “And each day brings us a few more unknown unknowns” • Donald Rumsfeld, 2003
Prediction Pitfalls… • “Who wants to hear actors talk?” – H.M. Warner, 1927 • “Forget it – no civil war picture ever made a nickel” – MGM executive, in 1937, advising against production of “Gone with the Wind” • “I think there might be a market for maybe five computers” – Thomas Watson, IBM, 1943 • “Computers in the future may weigh no more than 1.5 tons” – Popular Mechanics, 1949 • “There is no reason anyone would want a computer in their home” – Ken Olsen, founder of Digital Equipment Corporation, 1977
Machine Learning Methods and Statistical Modeling “If your experiment needs statistics, you ought to have done a better experiment” - Ernest Rutherford “But what if you haven’t done the experiment yet?” - Curt Breneman
Model Building and Validation DATASET Training set Test set Y-scrambling model validation! Bootstrap sample k Predictive Model Training Validation Learning Model Tuning / Prediction Prediction
Predictive QSAR Workflow example ~ 760 kNN QSAR models 10 Best models Acceptance criteria 48 anticonvulsants * Mining DBs using Probes Ca. 255,000chemicals in DBs 50 consensus (common) hits Similarity Cutoff 4334 hits Predictions with 10 QSAR models using applicability domain 9 compounds selected based on synthetic considerations 22compounds submitted to chemists 7compounds active NIH testing *Shen, M., et al. J. Med Chem., 2002, 45, 2811-2823; Shen, M., et al 2004, 47, 2356-2364.
RECCR Interactive Applications Data Preparation, Descriptor Generation and Modeling
Software developed at RECCR Mfold (Mike Zuker) • RNA, DNA secondary struture prediction Analyze (Mark Embrechts) • Fast KPLS test set mode with low memory footprint RECON • Transferable Atom Equivalent descriptors RECON for MOE • Drop-in interactive for MOE 2007 PROTEIN RECON for protein characterization • Property moment descriptors COLIBRI (with Alex Tropsha) • Binding site/ligand scoring using Universal Descriptor Space DIXEL • DNA Characterization and bioinformatics PEST • Compatible with Gaussian or Jaguar Software employing TAE descriptors
The RECCR Community http://reccr.chem.rpi.edu
ACKNOWLEDGMENTS • Current and Former members of the DDASSL group • Breneman Research Group (RPI Chemistry) • N. Sukumar • M. Sundling • Min Li • Long Han • Jed Zaretski • Theresa Hepburn • Mike Krein • Steve Mulick • Shiina Akasaka • Hongmei Zhang • C. Whitehead (Pfizer Global Research) • L. Shen (BNPI) • L. Lockwood (Syracuse Research Corporation) • M. Song (Synta Pharmaceuticals) • D. Zhuang (Simulations Plus) • W. Katt (Yale University chemistry graduate program) • Q. Luo (J & J) • Embrechts Research Group (RPI DSES) • Tropsha Research Group (UNC Chapel Hill) • Bennett Research Group (RPI Mathematics) • Jinbo Bi • Collaborators: • Lawrence Research Group (NYS Wadsworth Labs) • Inna Vitol • Cramer Research Group (RPI Chemical Engineering) • Funding • NIH (GM047372-07) • NIH (1P20HG003899-01) • NSF (BES-0214183, BES-0079436, IIS-9979860) • GE Corporate R&D Center • Millennium Pharmaceuticals • Concurrent Pharmaceuticals • Pfizer Pharmaceuticals • ICAGEN Pharmaceuticals • Eastman Kodak Company • Chemical Computing Group (CCG)