Chemical Informatics: Advances in Predictive Methods and Machine Learning Techniques

6. Machine Learning and Other Predictive Methods

Chemical Space 2

Predictive Methods • Predict physical, chemical, and biological properties • For example: 3D structure, NMR and mass spectra, boiling point, melting point, solubility (log P), toxicity, reaction rates, binding affinities, QSAR,…… • Dock PDB to PubChem 3

Methods • Spetrum of methods: • Schrodinger Equation • Molecular Dynamics • Machine Learning (e.g. SS prediction) 4

Chemical Informatics • Informatics must be able to deal with variable-size structured data • Graphical Models • (Recursive) Neural Networks • ILP • GA • SGs • Kernels 5

Neural Networks • Feedforward applied to fingerprints (1D) • Recursive applied to bond graph (2D) • Directed Acyclic Graph • State vectors • Weight sharing 6

Chemo/Bio Informatics Two Key Ingredients 1. Data 2. Similarity Measures Bioinformatics analogy and differences: • Data (GenBank, Swissprot, PDB) • Similarity (BLAST) 7

Organic Chemicals Fundamental Importance of Similarity Measures • Rapid Search of Large Databases • ProteinReceptor (Docking) • Small Molecule/Ligand (Similarity) • Predictive Methods (Kernel Methods) 8

Classification • Learning to Classify • Limited number of training examples (molecules, patients, sequences, etc.) • Learning algorithm (how to build the classifier?) • Generalization: should correctly classify test data. • Formalization • X is the input space • Y (e.g. toxic/non toxic, or {1,-1}) is the target class • f: X→Y is the classifier. 9

Linear Classifiers 10

Classification Fundamental Point: f is entirely determined by the dot products <xixj> measuring similarity between pairs of data points 11

Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel. 12

Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel K. • Fundamental property: the linear decision surface depends on K(xi ,xj)=<φ(xi ) , φ(xj)>. • All we need is the Gram similarity matrix K. K defines the local metric of the embedding space. 13

Finding a Good Kernel • Given: Two molecules. • Task: Systematically compute relevant similarity while being storage/time efficient. • Motivation: Enable efficient application of search and kernel algorithms. 14

Similarity: Data Representations NC(O)C(=O)O 15

CCCCCCc1ccc(cc1O)O CCCCCc1ccc(cc1)CO 15 Total: 1D SMILES Kernel 16

2D Molecule Graph Kernel • For chemical compounds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … } • Count labeled paths • Fingerprints (CsNsCdO) 17

A B a c b Similarity for Binary Fingerprints • Tally features: • Unique (a,b) • In common (c) • Similarity Formula • Tanimoto=c/(a+b+c) • Tversky(α,β)=c/(a*α+b*β+c) 18

Similarity Measures 19

2.8 A 2.0 A 4.2 A 1.4 A 3.4 A 3D Coordinate Kernel 20

Mutag 230 chemicals Mutagenicity in Salmonella. 125 positive/63 negative. Leave-one-out cross validation. PTC Several hundred chemicals. Toxicity / carcinogenicity in male and female mice and rats. Leave-one-out cross validation. NCI Several thousand chemicals. Growth Inhibition in 60 tumor cell lines. Close to 50/50. 20 random 80/20 cross validated splits. Datasets 21

Examples of Results:Mutag and PTC 22

Results 23

Example of Results (NCI) 24

Example of Results:NCI Accuracy/ROC 25

Comparison of Kernels (NCI) 26

Regression:Aqueous Solubility 30 folds cross-validation Delaney Dataset: 1440 Examples 27

XLogP 40 folds cross-validation Dataset size: 1991 S. J. Swamidass, J. Chen, P. Phung, J. Bruand, L. Ralaivola, and P. Baldi. Kernels for Small Molecules and the Prediction of Mutagenicity, Toxicity, and Anti-Cancer Activity. Proceedings of the 2005 Conference on Intelligent Systems for Molecular Biology, ISMB 05. Bioinformatics, 21, Supplement 1, i359-368, (2005). 28

Additional Representations 1D SMILES string 2D Atomic connection table 3D XYZ coordinates of labeled points 2.5D 2D surface in 3D space NC(CO)C(=O)O 4D Bag of conformers as XYZ coordinates of labeled points Multiple Conformers: 3.5D Bag of conformers in 2D surface in 3D space 29

2.5D Surface Kernel • Build a graph G (V = atoms) which approximates the surface (convex hull). • Use spectral graph kernels on G. 30

2.5D Surface Kernel • Compute regular/Delauney tessellation (tetrahedrization) of the convex hull of the atoms in the molecule • Use alpha-shape algorithm to detect surface triangles at relevant scale (keep interior and regular edges, remove singular edges, r on the order of water + carbon radius) • This yields a triangulated graph that approximates the surface (average degree 6). • Use spectral kernel with paths (l=3,4) on the triangulated surface graph. 31

Alpha Shape • The shape formed by a set of points. • Closely related solvent accessible surface. • Calculated in O(n*log(n)) using CGAL http://www.cgal.org/Manual/doc_html/cgal_manual/Alpha_shapes_3/Chapter_main.html 32

The Conformer Problem • Atoms connected by proximity • Different conformers have different graphs and features. 33

2.5D + Conformers = 3.5D Molecule A Molecule B 34

Molecular Representations and Kernels • 1D: SMILES strings • 2D: Graph of bonds • 2D: Surfaces • 2.5D: Conformers • 3D: Atomic coordinates (Pharmacophores, Epitopes) • 3.5D: Conformers • 4D: Temporal evolution • 4D: Isomers 35

Summary • ChemDB and other resources • Variety of kernels for small molecules • State-of-the-art performance on several benchmark datasets • For now, 2D kernels slightly better than 1D and 3D kernels • Many possible extensions: 2.5D, 3D, 3.5D, 4D kernels • Need for larger data sets and new models of cooperation in the chemistry community • Many open (ML) questions (e.g. clustering and visualizing 107 compounds, intelligent recognition of useful molecules/reactions, retrosynthesis, prediction of reaction rates, information retrieval from literature, docking, matching table of all proteins against all known compounds, origin of life, etc.) 36

Chemical Informatics: Advances in Predictive Methods and Machine Learning Techniques

Chemical Informatics: Advances in Predictive Methods and Machine Learning Techniques

Presentation Transcript

Predictive methods

Introduction to Machine Learning Multivariate Methods

Hands-on predictive models and machine learning for software

Classification, Regression and Other Learning Methods CS240B Presentation

Chapter 6: Machine Learning

GOMS and keystroke predictive methods

Machine learning methods for protein analyses

Some Other Efficient Learning Methods

Machine Learning Methods

Machine Learning Chapter 6. Bayesian Learning

Machine Learning Lecture 8: Ensemble Methods

Machine Learning: Lecture 6

Machine Learning for Big Data, Methods and Applications

Machine Learning Methods for Cybersecurity

Ensemble Methods for Machine Learning

Machine learning methods for protein analyses

Machine Learning: Lecture 6

What is Machine Learning and its Methods