560 likes | 932 Views
Molecular Modeling: Statistical Analysis of Complex Data. C372 Dr. Kelsey Forsythe. Terminology. SAR (Structure-Activity Relationships) Circa 19 th century? QSPR (Quantitative Structure Property Relationships) Relate structure to any physical-chemical property of molecule
E N D
Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe
Terminology • SAR (Structure-Activity Relationships) • Circa 19th century? • QSPR (Quantitative Structure Property Relationships) • Relate structure to any physical-chemical property of molecule • QSAR (Quantitative Structure Activity Relationships) • Specific to some biological/pharmaceutical function of molecule (Absorption, Distribution/Digestion, Metabolism, Excretion) • Brown and Frazer (1868-9) • ‘constitution’ related to biological response • LogP
Statistical Models • Simple • Mean, median and variation • Regression • Advanced • Validation methods • Principal components, co-variance • Multiple Regression QSAR,QSPR
Modern QSAR • Hansch et. Al. (1963) • Activity a ‘travel through body’ a partitioning between varied solvents • C (minimum dosage required) • p (hydrophobicity) • s (electronic) • Es (steric)
Choosing Descriptors • Buffon’s Problem • Needle Length? • Needle Color? • Needle Compostion? • Needle Sheen? • Needle Orientation?
Choosing Descriptors • Constitutional • MW, Natoms • Topological • Connectivity,Weiner index • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecuar volume • Quantum Chemical • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total
Choosing Descriptors • Constitutional • MW, Natomsof element • Topological • Connectivity,Weiner index (sums of bond distances) • 2D Fingerprints (bit-strings) • 3D topographical indices, pharmacophore keys • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecular volume
Choosing Descriptors • Chemical • Hydrophobicity (LogP) • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total • DG, DS, DH
Statistical Methods • 1-D analysis • Large dimension sets require decomposition techniques • Multiple Regression • PCA • PLS • Connecting a descriptor with a structural element so as to interpolate and extrapolate data
Simple Error Analysis(1-D) • Given N data points • Mean • Variance • Regression
Simple Error Analysis(1-D) • Given N data points • Regression
Simple Error Analysis(1-D) • Given N data points • (Poor 0<R2<1(Good)
Correlation vs. Dependence? • Correlation • Two or more variables/descriptors may correlate to the same property of a system • Dependence • When the correlation can be shown due to one changing due to the change in another • Ex. Elephants head and legs • Correlation exists between size of head and legs • The size of one does not depend on the size of the other
Quantitative Structure Activity/Property Relationships (QSAR,QSPR) • Discern relationships between multiple variables (descriptors) • Identify connections between structural traits (type of substituents, bond angles substituent locale) and descriptor values (e.g. activity, LogP, % denaturation)
Pre-Qualifications • Size • Minimum of FIVE samples per descriptor • Verification • Variance • Scaling • Correlations
QSAR/QSPRPre-Qualifications • Variance • Coefficient of Variation
QSAR/QSPRPre-Qualifications • Scaling • Standardizing or normalizing descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis
QSAR/QSPRPre-Qualifications • Scaling • Unit Variance (Auto Scaling) • Ensures equal statistical weights (initially) • Mean Centering
QSAR/QSPRPre-Qualifications • Correlations • Remove correlated descriptors • Keep correlated descriptors so as to reduce data set size • Apply math operation to remove correlation (PCR)
QSAR/QSPRPre-Qualifications • Correlations
QSAR/QSPR Scheme • Goal • Predict what happens next (extrapolate)! • Predict what happens between data points (interpolate)!
QSAR/QSPR Scheme • Types of Variable • Continuous • Concentration, occupied volume, partition coefficient, hydrophobicity • Discrete • Structural (1-meta substituted, 0-no meta substitution)
QSAR/QSPR-Principal Components Analysis • Reduces dimensionality of descriptors • Principle components are a set of vectors representing the variance in the original data
QSAR/QSPR-Principal Components Analysis • Geometric Analogy (3-D to 2-D PCA) y z x
QSAR/QSPR-Principal Components Analysis • Formulate matrix • Diagonalize matrix • Eigenvectors are the principal components • These principal components (new descriptors) are a linear combination of the original descriptors • Eigenvalues represent variance • Largest accounts for greatest % of data variance • Next corresponds to second greatest and so on
QSAR/QSPR-Principal Components Analysis • Formulate matrix (Several types) • Correlation or covariance (N x P) • N is number of molecules • P is number of descriptors • Variance-Covariance matrix (N x N) • Diagonalize (Rotate) matrix
QSAR/QSPR-Principal Components Analysis • Eigenvectors (Loadings) • Represents contribution from each original descriptor to PC (new descriptor) • # columns = # of descriptors • # rows = # of descriptors OR # of molecules • Eigenvalues • Indicate which PC most important (representative of original descriptors) • Benzene has 2 non-zero and 1 zero eigenvalue (planar)
QSAR/QSPR-Principal Components Analysis • Scores • Graphing each object/molecule in space of 2 or more PCs • # rows = # of objects/molecules • # columns = # of descriptors OR # of molecules For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system
SYBYL (Tripos Inc.) 10D3D
SYBYL (Tripos Inc.) • Eigenvalues a Explanation of variance in data
SYBYL (Tripos Inc.) • Each point corresponds to column (# points = # descriptors) in original data Proximitya correlation
SYBYL (Tripos Inc.) • Each point corresponds to row of original data (i.e. #points = #molecules) or graph of molecules in PC space Proximity a similarity Small acting Big H2O Napthalene He Molecular Size
SYBYL (Tripos Inc.) Outlier
QSAR/QSPR-Regression Types • Principal Component Analysis
QSAR/QSPR-Regression Types • Principal Component Analysis
Non-Linear Mappings • Calculate “distance” between points in N-d descriptor/parameter space • Euclidean • City-block distances • Randomly assign compounds in set to points on a 2-D or 3-D space • Minimize Difference (Optimal N-d 2D plot)
Non-Linear Mappings • Advantages • Non-linear • No assumptions! • Chance groupings unlikely (2D group likely an N-D group) • Disadvantages • Dependence on initial guess (Use PCA scores to improve)
QSAR/QSPR-Regression Types • Multiple Regression • PCR • PLS
QSAR/QSPR-Regression Types • Linear Regression • Minimize difference between calculated and observed values (residuals) Multiple Regression
QSAR/QSPR-Regression Types • Principal Component Regression • Regression but with Principal Components substituted for original descriptors/variables
QSAR/QSPR-Regression Types • Partial Least Squares • Cross-validation determines number of descriptors/components to use • Derive equation • Use bootstrapping and t-test to test coefficients in QSAR regression
QSAR/QSPR-Regression Types • Partial Least Squares (a.k.a. Projection to Latent Structures) • Regression of a Regression • Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s) • The ti’s are orthogonal • M= (# of variables/descriptors OR #observations/molecules whichever smaller)
QSAR/QSPR-Regression Types • PLS is NOT MR or PCR in practice • PLS is MR w/cross-validation • PLS Faster • couples the target representation (QSAR generation) and component generation while PCA and PCR are separate • PLS well applied to multi-variate problems
QSAR/QSPRPost-Qualifications • Confidence in Regression • TSS-Total Sum of Squares • ESS-Explained Sum of Squares • RSS-Residual Sum of Squares
QSAR/QSPRPost-Qualifications • Confidence in Prediction (Predictive Error Sum of Squares)
QSAR/QSPRPost-Qualification • Bias? • Bootstrapping • Choosing best model? • Cross Validation
QSAR/QSPRPost-Qualification • Bootstrapping • ASSUME calculated data is experimental/observed data • Randomly choose N data (allowing for a multiple picks of same data) • Regenerate parameters/regression • Repeat M times • Average over M bootstraps • Compare (calculate residual) • If close to zero then no bias • If large then bias exists M is typically 50-100