1 / 80

Introduction on QSAR and modelling of physico-chemical and biological properties

Introduction on QSAR and modelling of physico-chemical and biological properties. Alessandra Roncaglioni – IRFMN aroncaglioni@marionegri.it. Problems and approaches in computational chemistry. Outline. History QSAR/QSPR steps ( Descriptors ) Activity data Modelling approaches

nairi
Download Presentation

Introduction on QSAR and modelling of physico-chemical and biological properties

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction on QSAR and modellingofphysico-chemical and biologicalproperties Alessandra Roncaglioni – IRFMN aroncaglioni@marionegri.it Problems and approaches in computational chemistry

  2. Outline • History • QSAR/QSPR steps • (Descriptors) • Activity data • Modellingapproaches • Validation (OECD principles) • QSPR (Phys-chemproperties) • QSAR (Biologicalactivities) • Example (Demetra)

  3. QSAR postulates • The molecular structure is responsible for all the activities • Similar compounds have similar biological and chemico-physical properties (Meyer 1899) • Hansch analysis (‘70s) • Free Wilson approach (‘70s) H. Kubinyi. From Narcosis to Hyperspace: The History of QSAR. Quant. Struct.-Act. Relat., 21 (2002) 348-356.

  4. Hanschanalysis • Applied to congeneric series Log 1/C = a p + b s + c Es + const. where C = effect concentration p = octanol - water partition coefficient s = Hammett substituent constant (electronic) Es= Taft’s substituent constant • Linear free energy-related approach • McFarland principle

  5. Free-Wilson analysis Log 1/C = S ai + m where C = effect concentration ai= contribution per group m=activity of reference compound

  6. The old QSAR paradigm • Compounds in the series must be closely related • Same mode of action • Basics biological activity • Small number of “intuitive” properties • Linear relation

  7. The old QSAR paradigm Factors limiting to the old paradigm: • Sw availability • Calculation of molecular properties • Limited COMPUTING POWER • Costs of hw and sw

  8. The new QSAR paradigm • Heterogeneous compound sets • Mixed modes of action • Complex biological endpoints • Large number of properties • Non linear modelling

  9. The new QSAR paradigm Factors enabling new paradigm: • Increased computing power • QM calculations • Thousands of descriptors • Cost drop for hw and sw (freeware)

  10. Outline • History • QSAR/QSPR steps • (Descriptors) • Activity data • Modellingapproaches • Validation (OECD principles) • QSPR (Phys-chemproperties) • QSAR (Biologicalactivities) • Example (Demetra)

  11. QSAR flowchat

  12. 2D 3D Descriptos (1, …, m) Activity … ………………………… … ………………………… … ………………………… … ……………………………………………………… … ………………………… … ………………………… … … … … … … … … … … … … … … Compounds (1, …, n) Compounds (1, …, n) A = f (D(n,m)) A D(n,m)

  13. QSAR/QSPR definedby Y data • Quantitative Structure-Property Relationship: physico-chemical or biochemical properties • Boiling point • Partition coefficients (LogP) • Receptor binding • Quantitative Structure-Activity Relationship: interaction with the biota • Toxicity • Metabolism

  14. Activity data • Garbage in, garbage out • Quality and quantity of data • Suitable for purposes? • Intrinsic variability of Y data (particularly for QSAR): examples later on • Chemical domain covered with experimental data • As much as you can expecially if using complex models

  15. Nr. of compounds Quality / Accuracy Data need • Data are oneof the pillarsof the models • The goal istoextractknowledgefromthese data • Ifthey are toonoisyitisnotpossibletoextractthisknowledge • Enough number of training data Large number of compounds Keep data variability low

  16. Modellingsteps • Data pre-processing • Scaling X block and transformation of Y block • Variable selection • Application of algorithms to search for the reationship

  17. Data pre-processing (I) • Scaling variables • making sure that each descriptor has an equal chance of contributing to the overall analysis • E.g.: autoscaling, range scaling • Y transformation

  18. Data pre-processing (II) • Variable pruning • Detecting constant variables • Detecting quasi-constant variables • It can distinguish between informative and non informative variables • Detecting correlated variables • Variables can be grouped into correlation groups and the most correlated variable with the response is retained • Variables with missing values

  19. Variableselection • Reducing dimensions, facilitating data visualization and interpretation • Likely improving prediction performance • Hypothesis driven or statistically driven Wrappers: utilizes the choice of prediction method to score subsets of features according to their predictive power; Filters: a preprocessing step, independent of the choice of the predictor.

  20. Variableselectiontechniques • Principal component analysis (PCA) • Clustering • Self organizing maps (SOM) • Stepwise procedures • Forward selection: features are progressively incorporated into larger and larger subsets; • Backward elimination: starting with the set of all features and progressively eliminates the least promising ones. • Genetic algorithms • Variable importance/sensitivity

  21. Principal component analysis • Keep only those components that possess largest variation • PC are orthogonal to each other • Loadings plot

  22. Cluster analysis • Process of putting objects into classes, based on similarity • Descriptors in the same cluster are assume similar values for the molecules of the dataset • Many different methods and algorithms • different clustering methods will result in different clusters, with different relationships between them • different algorithms can be used to implement the same method (some may be more efficient than others)

  23. Hierarchical and non-hierarchical • A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not

  24. Hierarchical agglomerative • The hierarchy is built from the bottom upwards • Several different methods and algorithms • Basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items • at each step the most similar pair of molecules (or previously-formed clusters) are merged together • until everything is in one big cluster • methods differ in how they determine the similarity between clusters

  25. Hierarchical divisive • The hierarchy is built from the top downwards • At each step a cluster is chosen to divide, until each cluster has only one member • Various ways of choosing next cluster to divide • one with most members • one with least similar pair of members • etc. • Various ways of dividing it

  26. Non-hierarchical methods • Usually faster than hierarchical • e.g.: Nearest neighbour methods • best known is example is Jarvis-Patrick method • identify top k (e.g. 20) nearest neighbours for each object • two objects join same cluster if they have at least kmin of their top k nearest neighbours in common • tends to produce a few large heterogeneous clusters and a lot of singletons (single-member clusters)

  27. Selforganizingmaps • A SOM is an unsupervised NN condensing the input space into a low-dimensional representation

  28. Genetic algorithms • Based on the Darwinian evolutionary theory • individuals in a population of models are crossed over, mutated, then iteratively evaluated against a fitness function which gives a statistical evaluation of the model’s performances Initial population End 10010111011011 01010101010101 11100111001110 10001010010010 Y 11010111011111 Evaluationof individuals Fitness? N 10001010110010 Individual selection Mutations Cross-over

  29. Modellingapproaches • SAR • Quantitative SAR Categorical Y Classification Continuous Y Regression

  30. Modelling techniques • Multiple Linear Regression • PLS • … • Neural Networks • Classification trees • Discriminant analysis • Fuzzy classification • …

  31. Multiple Regression • Linear relationship between Y and several Xi descriptors Y = aX1 + bX2 + cXn + … + const. • Minimize error by least squares • May include polynomial terms

  32. PartialLeastSquare PLS similarly to PCA uses orthogonal PC of linearly correlated variables more closely related to the Y response Scores t1&t2 projection

  33. Neuralnetworks • Inspired by biological • NNs are a set of connected nonlinear elements making transformation of input I O O = f(I)

  34. The problemofoverfitting y = 0.979x + 0.344 R² = 0.956 y = -0.062x4 + 1.293x3 - 9.472x2 + 29.24x - 27.37 R² = 0.999

  35. Solution: validation Bestmodel Training prediction Performances Validation prediction Complexity

  36. Validationcriteria Interna validation - robustness • Cross-validation (LOO, LSO) • Bootstrap • Y scrambling External validation - prediction ability • Test set  representative of training set • Tropsha criteria Applicability domain

  37. Cross validation Leave One Out • All the data are used for fitting but one compound • Predict the excluded sample • Repeat it for all samples • Calculate Q2 or R2cv similarly to R2 on the basis of these predictions Problem: to optimistic if there are many samples Leave Many Out • Use larger groups to obtain a more realistic outcome

  38. Bootstrapping • Bootstrapping simulates what happen by randomly resampling the data set with n objects • K n-dimensional groups are generated by a randomly repeated some objects • The model obtained on the different sets is used to predict the values for the excluded sample • From each bootstrap sample the statistical parameter of interest is calculated • The estimation of accuracy is obtained by the average of all calculated statistics

  39. Y-scrambling • Randomply permutate Y responses while X variables are kept in the same order for several times

  40. Tropshacriteria* a) Q2 > 0.5; b) R2 > 0.6; c) (R2 - R20)/ R2 < 0.1 and 0.85 < k < 1.15 or (R2 – R’20)/ R2 < 0.1 and 0.85 < k’ < 1.15 (k=slope of the regression line) (R20 = R2 related to y=kx) d) if (c) is not fulfilled, then | R20 – R’20| < 0.3 * A. Golbraikh, M. Shen, Z. Xiao, Y.D. Xiao, K.-H. Lee, A. Tropsha, Rational selection of training and test sets for the development of validated QSAR models, JCAMD, 17 (2003) 241-253.

  41. Applicability domain The applicability domain of a (Q)SAR model is the response and chemical structure space in which the model makes predictions with a given reliability.* * Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. ATLA, 33:1-19, 2005.

  42. Applicability domain Training data

  43. New compounds Applicability domain Training data Can you see the intruders? Similarity!

  44. AD assessment Similarity measures: • Response range (span of activity data) • Chemometric treatment of the descriptor space • Fragment-based approaches

  45. Chemometric Methods • Descriptor range-based

  46. Chemometric Methods • Descriptor range-based • Geometric methods

  47. Chemometric Methods • Descriptor range-based • Geometric methods • Distance-based

  48. Chemometric Methods • Descriptor range-based • Geometric methods • Distance-based • Probability density distribution

  49. AMBIT software http://ambit.acad.bg/main.php

  50. AD assessment Similarity measures: • Response range (span of activity data) • Chemometric treatment of the descriptor space • Fragment-based approaches

More Related