600 likes | 772 Views
CZ3253: Computer Aided Drug design Drug Design Methods I: QSAR Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Terminology. SAR (Structure-Activity Relationships) Circa 19 th century?
E N D
CZ3253: Computer Aided Drug designDrug Design Methods I: QSARProf. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, National University of Singapore
Terminology • SAR (Structure-Activity Relationships) • Circa 19th century? • QSAR (Quantitative Structure Activity Relationships) • Specific to some biological/pharmaceutical function of molecule (Absorption, Distribution/Digestion, Metabolism, Excretion) • Brown and Frazer (1868-9) • ‘constitution’ related to biological response • LogP • QSPR (Quantitative Structure Property Relationships) • Relate structure to any physical-chemical property of molecule
Statistical Models • Simple • Mean, median and variation • Regression • Advanced • Validation methods • Principal components, co-variance • Multiple Regression QSAR,QSPR
Modern QSAR • Hansch et. Al. (1963) • Activity ‘travel through body’ partitioning between varied solvents • C (minimum dosage required) • p (hydrophobicity) • s (electronic) • Es (steric)
Choosing Descriptors • Buffon’s Problem • Needle Length? • Needle Color? • Needle Composition? • Needle Sheen? • Needle Orientation?
Choosing Descriptors • Constitutional • MW, Natomsof element • Topological • Connectivity,Weiner index (sums of bond distances) • 2D Fingerprints (bit-strings) • 3D topographical indices, pharmacophore keys • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecular volume
Choosing Descriptors • Chemical • Hydrophobicity (LogP) • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total • DG, DS, DH
Statistical Methods • 1-D analysis • Large dimension sets require decomposition techniques • Multiple Regression • PCA • PLS • Connecting a descriptor with a structural element so as to interpolate and extrapolate data
Simple Error Analysis(1-D) • Given N data points • Mean • Variance • Regression
Simple Error Analysis(1-D) • Given N data points • Regression
Simple Error Analysis(1-D) • Given N data points • (Poor 0<R2<1(Good)
Correlation vs. Dependence? • Correlation • Two or more variables/descriptors may correlate to the same property of a system • Dependence • When the correlation can be shown to be due to one changing caused by the change of the other • Example: Elephants head and legs • Correlation exists between size of head and legs • The size of one does not depend on the size of the other
Quantitative Structure Activity/Property Relationships (QSAR,QSPR) • Discern relationships between multiple variables (descriptors) • Identify connections between structural traits (type of subunits, bond angles local components) and descriptor values (e.g. activity, LogP, % denatured)
Pre-Qualifications • Size • Minimum of FIVE samples per descriptor • Verification • Variance • Scaling • Correlations
QSAR/QSPRPre-Qualifications • Variance • Coefficient of Variation
QSAR/QSPRPre-Qualifications • Scaling • Standardizing or normalizing descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis
QSAR/QSPRPre-Qualifications • Scaling • Unit Variance (Auto Scaling) • Ensures equal statistical weights (initially) • Mean Centering
QSAR/QSPRPre-Qualifications • Correlations • Remove correlated descriptors • Keep correlated descriptors so as to reduce data set size • Apply math operation to remove correlation (PCR)
QSAR/QSPRPre-Qualifications • Correlations
QSAR/QSPR Scheme • Goal • Predict what happens next (extrapolate)! • Predict what happens between data points (interpolate)!
QSAR/QSPR Scheme • Types of Variable • Continuous • Concentration, occupied volume, partition coefficient, hydrophobicity • Discrete • Structural (1: Methyl group substituted, 0: no methyl group substitution)
QSAR/QSPRPrincipal Components Analysis • Reduces dimensionality of descriptors • Principle components are a set of vectors representing the variance in the original data
Principal components –reducing the dimensionality of a dataset Clearly there is a relationship between x and y - a high correlation. We can define a new variable z = x+y such that we can express most of the variation in the data as the new variable z. This new variable is a principal component. y x pi is the ith principal component and ci,j is the coefficient of the variable xj. There are v such variables.
QSAR/QSPR-Principal Components Analysis • Geometric Analogy (3-D to 2-D PCA) y z x
Principal components PCA is the transformation of a set of correlated variables to a set of orthogonal uncorrelated variables called principal components. These new variables are a linear combination of the original variables in decreasing order of importance. eigenvalue scores (measure of the variation between samples) data matrix loadings (measure of the variation between variables)
QSAR/QSPRPrincipal Components Analysis • Formulate matrix • Diagonalize matrix • Eigenvectors are the principal components • These principal components (new descriptors) are a linear combination of the original descriptors • Eigenvalues represent variance • Largest accounts for greatest % of data variance • Next corresponds to second greatest and so on
QSAR/QSPR-Principal Components Analysis • Formulate matrix (Several types) • Correlation or covariance (N x P) • N is number of molecules • P is number of descriptors • Variance-Covariance matrix (N x N) • Diagonalize (Rotate) matrix
QSAR/QSPR-Principal Components Analysis • Eigenvectors (Loadings) • Represents contribution from each original descriptor to PC (new descriptor) • # columns = # of descriptors • # rows = # of descriptors OR # of molecules • Eigenvalues • Indicate which PC most important (representative of original descriptors) • Benzene has 2 non-zero and 1 zero eigenvalue (planar)
QSAR/QSPR-Principal Components Analysis • Scores • Graphing each object/molecule in space of 2 or more PCs • # rows = # of objects/molecules • # columns = # of descriptors OR # of molecules For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system
Principal components The PC’s each maximise the variance in the data in orthogonal directions and are ordered by size. Usually only a few components are needed to explain (>90%) of the variance in the data – or the properties are not relevant The first step is to calculate the varience- covarience matrix from the data PC1 y PC2 x
Principal components If there are s observations each of which contains v values, the data can be represented by a matrix D with v rows and s columns. The varience-covariance matrix is Z = DTD. The eigenvectors of Z are the principal components. Z is a square symmetric matrix so the eigenvectors are orthogonal. Usually the matrix is diagonalised to obtain the eigenvectors (the weightings for the properties) and eigenvalues (the explained variance). PC1 y PC2 x
Principal components The output looks like this : eigenvalues – explain % variance 80 10 5 3 2 p1 .2 .3 .4 .1 .1 p2 .01 .02 .3 .4 .5 p3 .02 .03 .1 .2 .4 p5 .03 .4 .4 .04 .3 p5 .3 .5 .5 .05 .3 Multiply the property value for molecule by this for each eigenvalue Properties Can do regression on the PC’s, eg V = 0.3PC1(0.1) + 0.2PC2(0.1) + 0.4(0.2) so, we’ve reduced a 5 property problem to a two property problem
QSAR on SYBYL (Tripos Inc.) 10D3D
QSAR on SYBYL (Tripos Inc.) • Eigenvalues Explanation of variance in data
QSAR on SYBYL (Tripos Inc.) • Each point corresponds to column (# points = # descriptors) in original data Proximitycorrelation
QSAR on SYBYL (Tripos Inc.) • Each point corresponds to row of original data (i.e. #points = #molecules) or graph of molecules in PC space Proximity a similarity Small acting Big H2O Napthalene He Molecular Size
QSAR on SYBYL (Tripos Inc.) Outlier
QSAR/QSPR-Regression Types • Principal Component Analysis
QSAR/QSPR-Regression Types • Principal Component Analysis
Non-Linear Mappings • Calculate “distance” between points in N-dimensional descriptor/parameter space • Euclidean • City-block distances • Randomly assign compounds in set to points on a 2-D or 3-D space • Minimize Difference (Optimal N-d 2D plot)
Non-Linear Mappings • Advantages • Non-linear • No assumptions! • Chance groupings unlikely (2D group likely an N-D group) • Disadvantages • Dependence on initial guess (Use PCA scores to improve)
QSAR/QSPR-Regression Types • Multiple Regression (MR) • PCR • PLS
QSAR/QSPR-Regression Types • Linear Regression • Minimize difference between calculated and observed values (residuals) Multiple Regression
QSAR/QSPR-Regression Types • Principal Component Regression • Regression but with Principal Components substituted for original descriptors/variables
QSAR/QSPR-Regression Types • Partial Least Squares • Cross-validation determines number of descriptors/components to use • Derive equation • Use bootstrapping and t-test to test coefficients in QSAR regression
QSAR/QSPR-Regression Types • Partial Least Squares (a.k.a. Projection to Latent Structures) • Regression of a Regression • Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s) • The ti’s are orthogonal • M= (# of variables/descriptors OR #observations/molecules whichever smaller)
QSAR/QSPR-Regression Types • PLS is NOT MR or PCR in practice • PLS is MR w/cross-validation • PLS Faster • couples the target representation (QSAR generation) and component generation while PCA and PCR are separate • PLS well applied to multi-variants problems
QSAR/QSPRPost-Qualifications • Confidence in Regression • TSS-Total Sum of Squares • ESS-Explained Sum of Squares • RSS-Residual Sum of Squares