Chemoinformatics Theory

2. Outline Chemoinformatics-What is it? Molecular descriptors and chemical spaces Chemical spaces and molecular similarity Molecular similarity, dissimilarity, diversity Modification and Simplification of chemical spaces Compound Classification and Selection Similarity Searching Machine Learning Methods Library Design Quantitative Structure Activity Relationship Analysis (QSAR) Virtual Screening and compound filtering

3. Chemoinformatics-What is it? Use of computer and informational techniques, applied to a range of problems in the field of chemistry. This in silico techniques are used in pharmaceutical companies in the process of drug discovery.

4. Chemoinformatics-What is it?

5. Chemoinformatics-What is it?

6. Molecular descriptors and chemical spaces Chemical reference spaces � where molecular data sets are projected and analysis of design is carried out. Definition of chemical spaces critically depend on the use of computational descriptors of molecular structure, physical or chemical properties.

7. Molecular descriptors and chemical spaces

8. Molecular descriptors and chemical spaces

9. Chemical spaces and molecular similarity Similar Property Principle � Molecules having similar structures and properties should also exhibit similar activity. (Often but not always true) Thus, molecules that are located closely together in chemical reference space are often considered to be functionally related.

10. Chemical spaces and molecular similarity

11. Molecular similarity, dissimilarity, and diversity Diversity analysis Select different compounds from a given population Evenly populate a given chemical space with candidate molecules. � Only selecting compounds that are at least a pre-defined minimum distance away from others. Dissimilarity : Inverse of molecular similarity Dissimilarity analysis played a major role in the pharmaceutical industry.

12. Molecular similarity, dissimilarity, and diversity Dissimilarity algorithm Select a subset of k maximally dissimilar compounds ? due to combinatorial problem, non-trivial challenge Other dissimilarity algorithm Decide on a desired size, n, of a final subset Select a seed compound and place it in the subset Calculate the dissimilarity between each of the other compounds and those in the subset Choose the next compound as the one most dissimilar to those in the subset If fewer than n in the subset, repeat the calculation of the dissimilarity until n is achieved Complexity varies as the square of n

13. Modification and Simplification of Chemical Spaces High dimensional chemistry space might often too complex for carrying meaningful analyses. Why? 1) Major areas of high dimensional chemical space might not populated and remained as �empty�. 2) Correlation effects between selected descriptors dramatically distort the reference space. Therefore, 1) Design low-dimensional reference spaces 2) Simplify high-dimensional spaces 3) Reduce their dimensionality

14. Modification and Simplification of Chemical Spaces (cont�d.) Auto scaling or variance scaling Why? Descriptor with large value range will dominate those having smaller one. Dimension reduction

15. Modification and Simplification of Chemical Spaces (cont�d.) � Dimension reduction Assumption : High dimensional descriptor spaces have at least some intrinsic redundancy. Two approaches: To identify those descriptors that are most important for representing the original dataset and the relationships they form between objects for lower-dimensional representation ex) multi dimensional scaling (Agrafiotis, et al. 2001) To attempt to generate new descriptors for lower-dimensional spaces by combining important contributors from original one. ex) Principal Component Analysis (PCA)

16. Modification and Simplification of Chemical Spaces (cont�d.) - Simplification Simplification of n-dimensional descriptor spaces ex) Binary descriptor transformation above mean ? 1, below mean ? 0

17. Compound Classification and Selection- CLUSTER ANALYSIS Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar Many algorithms for doing this Hierarchical methods seem to be better than non-hierarchical Sometimes called a �distance-based� approach to compound selection, because distance is measured between pairs of compounds

18. Compound Classification and Selection- CLUSTER ANALYSIS

19. Compound Classification and Selection- Hierarchical Clustering The composition of each cluster depends on the one from which it was derived Agglomerative methods start at the bottom and merge similar clusters (bottom-up) Ward�s method: clusters are formed to minimize the variance (i.e., the sum of the squared deviations from the mean) Others: centroid method and the median method Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data (top-down)

20. Compound Classification and Selection- Non-Hierarchical Clustering Organize compounds into an initially defined number of independent clusters. Methods: nearest neighbor: Jarvis Patrick clustering relocation: K-means

21. Compound Classification and Selection- Partitioning Rather than comparing molecular positions, establish a coordinate ore reference system in chemical space. Compounds that populate the same partitions considered to be similar.

22. Compound Classification and Selection- Partitioning

23. Compound Classification and Selection- Statistical Partitioning Recursive partitioning � most popular statistical partitioning. A decision tree method Divides datasets along decision trees formed by sequences of molecular descriptors. ex) The compounds could be divided according to molecular weight.

24. Compound Classification and Selection- Statistical Partitioning Statistical partitioning methods such as recursive partitioning is also very attractive tools for the analysis of HTS data sets.

25. Similarity Searching �Structural queries and graphs

26. Similarity Searching �Structural queries and graphs Contemporary substructure search methods are mostly based on dictionaries of predefined molecular fragments. Queries can be transformed into an machine-readable format such as Simplified Molecular Input Line Entry Specification (SMILES) code. SMILES encodes 2D representation of molecules as linear strings of alpha-numeric characters.

27. Similarity Searching �Structural queries and graphs (SMILES)

28. Similarity Searching �Structural queries and graphs Subgraph-isomorphism : Common substructures can also determined by systematic mapping of corresponding node positions in graph. However, computationally expensive Reduced graph : Nodes do not represent atoms but features such as functionally important groups or whole ring system. Become more suitable for node matching procedures and similarity searching.

29. Similarity Searching �Structural queries and graphs (Reduced graph )

30. Similarity Searching � Pharmacophore A molecular framework that carries the essential features responsible for drug�s biological activity Spatial arrangements of atoms or groups that are responsible for biological activity Often used as 3D queries for database searching

31. Similarity Searching �Fingerprints Fingerprints : widely used similarity search tools. consist of various descriptors that are encoded as bit strings Bit strings of query and database compared using similarity metric such as Tanimoto coefficient

32. Machine Learning Methods Important role in chemoinformatics For example, it is usually difficult to predict which types of descriptors are most suitable for a given search, classification. Therefore, machine learning techniques are often used to facilitate descriptor selection Applied to generate complex predictive models by iterative processing of molecular learning sets Genetic algorithms Neural Networks Self Organizing Maps (SOM)

33. Machine Learning Methods � Genetic algorithms Different parameters and model solutions to given problems are encoded in a chromosome and subjected to iterative random variation, thus generating a population. Solutions provided by these chromosomes are evaluated by fitness function that assign high scores to desired results. Chromosomes yielding best intermediate solutions are subjected to mutation and crossover operation that correspond to random genetic mutations and gene recombination events. The resulting modified chromosomes represent the next generation and the process is continued until the obtained results meet a satisfactory convergence criterion

34. Library Design Diverse Library Focused Library

35. Quantitative Structure Activity Relationship Analysis (QSAR) Goal : Evaluation of molecular features that determine biological activity and the prediction of compound potency as a function of structural modification

36. Virtual Screening and Compound Filtering VS(Virtual Screening) - the process of screening large databases on the computer for molecules having desired properties and biological activity. A major application of VS techniques is the identification of novel active molecules in large compound databases. Series of known active compounds are added as search templates to a source DB and then compounds that are identified as similar to these templates based on VS calculations are selected as candidate molecules for experimental evaluation

38. Virtual Screening and Compound Filtering- Filter Functions Filter functions are very popular tools for VS Attempts to identify compounds with desired properties and discard others. Have been implemented for analysis of diverse molecular properties including chemical reactivity, toxicity, drug-like character, absorption, distribution, metabolism, excretion (ADME) parameters. Ex) Aqueous solubility, Passive absorption blood-brain-barrier penetration, metabolic stability, oral availability

39. Virtual Screening and Compound Filtering- Filter Functions

40. Thank You

Chemoinformatics Theory

Chemoinformatics Theory

Presentation Transcript

Chemoinformatics tools for lead discovery

University of Sheffield MSc in Chemoinformatics

A bibliometric analysis of chemoinformatics

Chemoinformatics

Current trends & hot topics in Chemoinformatics

Chemoinformatics in Molecular Docking and Drug Discovery

Chemoinformatics and Metabolism

Introducing Chemoinformatics

Use of Machine Learning in Chemoinformatics

A bibliometric analysis of chemoinformatics

Chemoinformatics, cheminformatics, chemical informatics: What is it?

AMBIT Chemoinformatics Software for Data Management

Chemoinformatics

Chemoinformatics approaches to virtual screening and in silico design

Introduction to Chemoinformatics

Chemoinformatics in Drug Design

Bioinformatics Drug Informatics Vaccine Informatics Chemoinformatics

Chemoinformatics

Chemoinformatics and Metabolism

Chemoinformatics

Chemoinformatics Theory