850 likes | 1.32k Views
Introduction on QSAR and modelling of physico-chemical and biological properties. Alessandra Roncaglioni – IRFMN aroncaglioni@marionegri.it. Problems and approaches in computational chemistry. Outline. History QSAR/QSPR steps ( Descriptors ) Activity data Modelling approaches
E N D
Introduction on QSAR and modellingofphysico-chemical and biologicalproperties Alessandra Roncaglioni – IRFMN aroncaglioni@marionegri.it Problems and approaches in computational chemistry
Outline • History • QSAR/QSPR steps • (Descriptors) • Activity data • Modellingapproaches • Validation (OECD principles) • QSPR (Phys-chemproperties) • QSAR (Biologicalactivities) • Example (Demetra)
QSAR postulates • The molecular structure is responsible for all the activities • Similar compounds have similar biological and chemico-physical properties (Meyer 1899) • Hansch analysis (‘70s) • Free Wilson approach (‘70s) H. Kubinyi. From Narcosis to Hyperspace: The History of QSAR. Quant. Struct.-Act. Relat., 21 (2002) 348-356.
Hanschanalysis • Applied to congeneric series Log 1/C = a p + b s + c Es + const. where C = effect concentration p = octanol - water partition coefficient s = Hammett substituent constant (electronic) Es= Taft’s substituent constant • Linear free energy-related approach • McFarland principle
Free-Wilson analysis Log 1/C = S ai + m where C = effect concentration ai= contribution per group m=activity of reference compound
The old QSAR paradigm • Compounds in the series must be closely related • Same mode of action • Basics biological activity • Small number of “intuitive” properties • Linear relation
The old QSAR paradigm Factors limiting to the old paradigm: • Sw availability • Calculation of molecular properties • Limited COMPUTING POWER • Costs of hw and sw
The new QSAR paradigm • Heterogeneous compound sets • Mixed modes of action • Complex biological endpoints • Large number of properties • Non linear modelling
The new QSAR paradigm Factors enabling new paradigm: • Increased computing power • QM calculations • Thousands of descriptors • Cost drop for hw and sw (freeware)
Outline • History • QSAR/QSPR steps • (Descriptors) • Activity data • Modellingapproaches • Validation (OECD principles) • QSPR (Phys-chemproperties) • QSAR (Biologicalactivities) • Example (Demetra)
2D 3D Descriptos (1, …, m) Activity … ………………………… … ………………………… … ………………………… … ……………………………………………………… … ………………………… … ………………………… … … … … … … … … … … … … … … Compounds (1, …, n) Compounds (1, …, n) A = f (D(n,m)) A D(n,m)
QSAR/QSPR definedby Y data • Quantitative Structure-Property Relationship: physico-chemical or biochemical properties • Boiling point • Partition coefficients (LogP) • Receptor binding • Quantitative Structure-Activity Relationship: interaction with the biota • Toxicity • Metabolism
Activity data • Garbage in, garbage out • Quality and quantity of data • Suitable for purposes? • Intrinsic variability of Y data (particularly for QSAR): examples later on • Chemical domain covered with experimental data • As much as you can expecially if using complex models
Nr. of compounds Quality / Accuracy Data need • Data are oneof the pillarsof the models • The goal istoextractknowledgefromthese data • Ifthey are toonoisyitisnotpossibletoextractthisknowledge • Enough number of training data Large number of compounds Keep data variability low
Modellingsteps • Data pre-processing • Scaling X block and transformation of Y block • Variable selection • Application of algorithms to search for the reationship
Data pre-processing (I) • Scaling variables • making sure that each descriptor has an equal chance of contributing to the overall analysis • E.g.: autoscaling, range scaling • Y transformation
Data pre-processing (II) • Variable pruning • Detecting constant variables • Detecting quasi-constant variables • It can distinguish between informative and non informative variables • Detecting correlated variables • Variables can be grouped into correlation groups and the most correlated variable with the response is retained • Variables with missing values
Variableselection • Reducing dimensions, facilitating data visualization and interpretation • Likely improving prediction performance • Hypothesis driven or statistically driven Wrappers: utilizes the choice of prediction method to score subsets of features according to their predictive power; Filters: a preprocessing step, independent of the choice of the predictor.
Variableselectiontechniques • Principal component analysis (PCA) • Clustering • Self organizing maps (SOM) • Stepwise procedures • Forward selection: features are progressively incorporated into larger and larger subsets; • Backward elimination: starting with the set of all features and progressively eliminates the least promising ones. • Genetic algorithms • Variable importance/sensitivity
Principal component analysis • Keep only those components that possess largest variation • PC are orthogonal to each other • Loadings plot
Cluster analysis • Process of putting objects into classes, based on similarity • Descriptors in the same cluster are assume similar values for the molecules of the dataset • Many different methods and algorithms • different clustering methods will result in different clusters, with different relationships between them • different algorithms can be used to implement the same method (some may be more efficient than others)
Hierarchical and non-hierarchical • A basic distinction is between clustering methods that organise clusters hierarchically, and those that do not
Hierarchical agglomerative • The hierarchy is built from the bottom upwards • Several different methods and algorithms • Basic Lance-Williams algorithm (common to all methods) starts with table of similarities between all pairs of items • at each step the most similar pair of molecules (or previously-formed clusters) are merged together • until everything is in one big cluster • methods differ in how they determine the similarity between clusters
Hierarchical divisive • The hierarchy is built from the top downwards • At each step a cluster is chosen to divide, until each cluster has only one member • Various ways of choosing next cluster to divide • one with most members • one with least similar pair of members • etc. • Various ways of dividing it
Non-hierarchical methods • Usually faster than hierarchical • e.g.: Nearest neighbour methods • best known is example is Jarvis-Patrick method • identify top k (e.g. 20) nearest neighbours for each object • two objects join same cluster if they have at least kmin of their top k nearest neighbours in common • tends to produce a few large heterogeneous clusters and a lot of singletons (single-member clusters)
Selforganizingmaps • A SOM is an unsupervised NN condensing the input space into a low-dimensional representation
Genetic algorithms • Based on the Darwinian evolutionary theory • individuals in a population of models are crossed over, mutated, then iteratively evaluated against a fitness function which gives a statistical evaluation of the model’s performances Initial population End 10010111011011 01010101010101 11100111001110 10001010010010 Y 11010111011111 Evaluationof individuals Fitness? N 10001010110010 Individual selection Mutations Cross-over
Modellingapproaches • SAR • Quantitative SAR Categorical Y Classification Continuous Y Regression
Modelling techniques • Multiple Linear Regression • PLS • … • Neural Networks • Classification trees • Discriminant analysis • Fuzzy classification • …
Multiple Regression • Linear relationship between Y and several Xi descriptors Y = aX1 + bX2 + cXn + … + const. • Minimize error by least squares • May include polynomial terms
PartialLeastSquare PLS similarly to PCA uses orthogonal PC of linearly correlated variables more closely related to the Y response Scores t1&t2 projection
Neuralnetworks • Inspired by biological • NNs are a set of connected nonlinear elements making transformation of input I O O = f(I)
The problemofoverfitting y = 0.979x + 0.344 R² = 0.956 y = -0.062x4 + 1.293x3 - 9.472x2 + 29.24x - 27.37 R² = 0.999
Solution: validation Bestmodel Training prediction Performances Validation prediction Complexity
Validationcriteria Interna validation - robustness • Cross-validation (LOO, LSO) • Bootstrap • Y scrambling External validation - prediction ability • Test set representative of training set • Tropsha criteria Applicability domain
Cross validation Leave One Out • All the data are used for fitting but one compound • Predict the excluded sample • Repeat it for all samples • Calculate Q2 or R2cv similarly to R2 on the basis of these predictions Problem: to optimistic if there are many samples Leave Many Out • Use larger groups to obtain a more realistic outcome
Bootstrapping • Bootstrapping simulates what happen by randomly resampling the data set with n objects • K n-dimensional groups are generated by a randomly repeated some objects • The model obtained on the different sets is used to predict the values for the excluded sample • From each bootstrap sample the statistical parameter of interest is calculated • The estimation of accuracy is obtained by the average of all calculated statistics
Y-scrambling • Randomply permutate Y responses while X variables are kept in the same order for several times
Tropshacriteria* a) Q2 > 0.5; b) R2 > 0.6; c) (R2 - R20)/ R2 < 0.1 and 0.85 < k < 1.15 or (R2 – R’20)/ R2 < 0.1 and 0.85 < k’ < 1.15 (k=slope of the regression line) (R20 = R2 related to y=kx) d) if (c) is not fulfilled, then | R20 – R’20| < 0.3 * A. Golbraikh, M. Shen, Z. Xiao, Y.D. Xiao, K.-H. Lee, A. Tropsha, Rational selection of training and test sets for the development of validated QSAR models, JCAMD, 17 (2003) 241-253.
Applicability domain The applicability domain of a (Q)SAR model is the response and chemical structure space in which the model makes predictions with a given reliability.* * Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. ATLA, 33:1-19, 2005.
Applicability domain Training data
New compounds Applicability domain Training data Can you see the intruders? Similarity!
AD assessment Similarity measures: • Response range (span of activity data) • Chemometric treatment of the descriptor space • Fragment-based approaches
Chemometric Methods • Descriptor range-based
Chemometric Methods • Descriptor range-based • Geometric methods
Chemometric Methods • Descriptor range-based • Geometric methods • Distance-based
Chemometric Methods • Descriptor range-based • Geometric methods • Distance-based • Probability density distribution
AMBIT software http://ambit.acad.bg/main.php
AD assessment Similarity measures: • Response range (span of activity data) • Chemometric treatment of the descriptor space • Fragment-based approaches