430 likes | 575 Views
Use of Machine Learning in Chemoinformatics. Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course. Major Aspects of Chemoinformatics. Databases: Development of databases for storage and retrieval of small molecule structures and their properties.
E N D
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course
Major Aspects of Chemoinformatics • Databases: Development of databases for storage and retrieval of small molecule structures and their properties. • Machine learning: Training of Decision Trees, Neural Networks, Self Organizing Maps, etc. on molecular data. • Predictions: Molecular properties relevant to drugs, virtual screening of chemical libraries, system chemical biology networks…
Clustering: Self Organizing Maps Distinguishing molecules of different biological activities and finding a new lead structure
Clustering: Self Organizing Maps Distinguishing molecules of different biological activities and finding a new lead structure
Clustering: Self Organizing Maps Distinguishing molecules of different biological activities and finding a new lead structure
Clustering: Self Organizing Maps Distinguishing molecules of different biological activities and finding a new lead structure
Machine Learning QSAR Virtual Screening Clustering Classification Molecular Structures Properties Molecular Descriptors
Different descriptor types • Simple feature counts (such as number of rotatable bonds or molecular weight) • Fragmental descriptors which indicate the presence or absence (or count) of groups of atoms and substructures • Physicochemical properties (density, solubility, vdWaals volume) • Topological indices (size, branching, overall shape)
Major Aspects of Chemoinformatics • Databases: Development of databases for storage and retrieval of small molecule structures and their properties. • Machine learning: Training of Decision Trees, Neural Networks, Self Organizing Maps, etc. on molecular data. • Predictions: Molecular properties relevant to drugs, virtual screening of chemical libraries, system chemical biology networks…
Quantitative Structure-Activity Relationships (QSAR) In QSAR models structural parameters (descriptors) are fitted to experimental data for biological activity (or another given property, P)
Virtual screening • Computational techniques for a rapid assessment of large libraries of chemical structures in order to guide the selection of likely drug candidates.
Similarity Search • Similar Property Principle – Molecules having similar structures and properties are expected to exhibit similar biological activity. • Thus, molecules that are located closely together in the chemical space are often considered to be functionally related.
Fingerprints-based Similarity Search • widely used similarity search tool • consists of descriptors encoded as bit strings • Bit strings of query and database are compared using similarity metric such as Tanimoto coefficient • MACCS fingerprints: 166 structural keys • that answer questions of the type: • Is there a ring of size 4? • Is at least one F, Br, Cl, or I present? • where the answer is either • TRUE (1) or FALSE (0)
Tanimoto Similarity or 90% similarity
Molecular editors and viewers http://www.chemaxon.com/products/marvin/
Molecular editors and viewers http://jmol.sourceforge.net/
Format conversion http://cactus.nci.nih.gov/translate/