360 likes | 381 Views
Explore the cutting-edge predictive methods in chemical informatics like 3D structure prediction, neural networks, and kernel methods. Learn about similarity measures, data representations, and classification algorithms for organic chemicals. Discover how machine learning and other advanced techniques can revolutionize research in chemical space.
E N D
Predictive Methods • Predict physical, chemical, and biological properties • For example: 3D structure, NMR and mass spectra, boiling point, melting point, solubility (log P), toxicity, reaction rates, binding affinities, QSAR,…… • Dock PDB to PubChem 3
Methods • Spetrum of methods: • Schrodinger Equation • Molecular Dynamics • Machine Learning (e.g. SS prediction) 4
Chemical Informatics • Informatics must be able to deal with variable-size structured data • Graphical Models • (Recursive) Neural Networks • ILP • GA • SGs • Kernels 5
Neural Networks • Feedforward applied to fingerprints (1D) • Recursive applied to bond graph (2D) • Directed Acyclic Graph • State vectors • Weight sharing 6
Chemo/Bio Informatics Two Key Ingredients 1. Data 2. Similarity Measures Bioinformatics analogy and differences: • Data (GenBank, Swissprot, PDB) • Similarity (BLAST) 7
Organic Chemicals Fundamental Importance of Similarity Measures • Rapid Search of Large Databases • ProteinReceptor (Docking) • Small Molecule/Ligand (Similarity) • Predictive Methods (Kernel Methods) 8
Classification • Learning to Classify • Limited number of training examples (molecules, patients, sequences, etc.) • Learning algorithm (how to build the classifier?) • Generalization: should correctly classify test data. • Formalization • X is the input space • Y (e.g. toxic/non toxic, or {1,-1}) is the target class • f: X→Y is the classifier. 9
Classification Fundamental Point: f is entirely determined by the dot products <xixj> measuring similarity between pairs of data points 11
Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel. 12
Non Linear Classification(Kernel Methods) • We can transform a nonlinear problem into a linear one using a kernel K. • Fundamental property: the linear decision surface depends on K(xi ,xj)=<φ(xi ) , φ(xj)>. • All we need is the Gram similarity matrix K. K defines the local metric of the embedding space. 13
Finding a Good Kernel • Given: Two molecules. • Task: Systematically compute relevant similarity while being storage/time efficient. • Motivation: Enable efficient application of search and kernel algorithms. 14
Similarity: Data Representations NC(O)C(=O)O 15
CCCCCCc1ccc(cc1O)O CCCCCc1ccc(cc1)CO 15 Total: 1D SMILES Kernel 16
2D Molecule Graph Kernel • For chemical compounds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … } • Count labeled paths • Fingerprints (CsNsCdO) 17
A B a c b Similarity for Binary Fingerprints • Tally features: • Unique (a,b) • In common (c) • Similarity Formula • Tanimoto=c/(a+b+c) • Tversky(α,β)=c/(a*α+b*β+c) 18
2.8 A 2.0 A 4.2 A 1.4 A 3.4 A 3D Coordinate Kernel 20
Mutag 230 chemicals Mutagenicity in Salmonella. 125 positive/63 negative. Leave-one-out cross validation. PTC Several hundred chemicals. Toxicity / carcinogenicity in male and female mice and rats. Leave-one-out cross validation. NCI Several thousand chemicals. Growth Inhibition in 60 tumor cell lines. Close to 50/50. 20 random 80/20 cross validated splits. Datasets 21
Results 23
Example of Results:NCI Accuracy/ROC 25
Regression:Aqueous Solubility 30 folds cross-validation Delaney Dataset: 1440 Examples 27
XLogP 40 folds cross-validation Dataset size: 1991 S. J. Swamidass, J. Chen, P. Phung, J. Bruand, L. Ralaivola, and P. Baldi. Kernels for Small Molecules and the Prediction of Mutagenicity, Toxicity, and Anti-Cancer Activity. Proceedings of the 2005 Conference on Intelligent Systems for Molecular Biology, ISMB 05. Bioinformatics, 21, Supplement 1, i359-368, (2005). 28
Additional Representations 1D SMILES string 2D Atomic connection table 3D XYZ coordinates of labeled points 2.5D 2D surface in 3D space NC(CO)C(=O)O 4D Bag of conformers as XYZ coordinates of labeled points Multiple Conformers: 3.5D Bag of conformers in 2D surface in 3D space 29
2.5D Surface Kernel • Build a graph G (V = atoms) which approximates the surface (convex hull). • Use spectral graph kernels on G. 30
2.5D Surface Kernel • Compute regular/Delauney tessellation (tetrahedrization) of the convex hull of the atoms in the molecule • Use alpha-shape algorithm to detect surface triangles at relevant scale (keep interior and regular edges, remove singular edges, r on the order of water + carbon radius) • This yields a triangulated graph that approximates the surface (average degree 6). • Use spectral kernel with paths (l=3,4) on the triangulated surface graph. 31
Alpha Shape • The shape formed by a set of points. • Closely related solvent accessible surface. • Calculated in O(n*log(n)) using CGAL http://www.cgal.org/Manual/doc_html/cgal_manual/Alpha_shapes_3/Chapter_main.html 32
The Conformer Problem • Atoms connected by proximity • Different conformers have different graphs and features. 33
2.5D + Conformers = 3.5D Molecule A Molecule B 34
Molecular Representations and Kernels • 1D: SMILES strings • 2D: Graph of bonds • 2D: Surfaces • 2.5D: Conformers • 3D: Atomic coordinates (Pharmacophores, Epitopes) • 3.5D: Conformers • 4D: Temporal evolution • 4D: Isomers 35
Summary • ChemDB and other resources • Variety of kernels for small molecules • State-of-the-art performance on several benchmark datasets • For now, 2D kernels slightly better than 1D and 3D kernels • Many possible extensions: 2.5D, 3D, 3.5D, 4D kernels • Need for larger data sets and new models of cooperation in the chemistry community • Many open (ML) questions (e.g. clustering and visualizing 107 compounds, intelligent recognition of useful molecules/reactions, retrosynthesis, prediction of reaction rates, information retrieval from literature, docking, matching table of all proteins against all known compounds, origin of life, etc.) 36