Molecular Modeling: Statistical Analysis of Complex Data

Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe

Terminology • SAR (Structure-Activity Relationships) • Circa 19th century? • QSPR (Quantitative Structure Property Relationships) • Relate structure to any physical-chemical property of molecule • QSAR (Quantitative Structure Activity Relationships) • Specific to some biological/pharmaceutical function of molecule (Absorption, Distribution/Digestion, Metabolism, Excretion) • Brown and Frazer (1868-9) • ‘constitution’ related to biological response • LogP

Statistical Models • Simple • Mean, median and variation • Regression • Advanced • Validation methods • Principal components, co-variance • Multiple Regression QSAR,QSPR

Modern QSAR • Hansch et. Al. (1963) • Activity a ‘travel through body’ a partitioning between varied solvents • C (minimum dosage required) • p (hydrophobicity) • s (electronic) • Es (steric)

Choosing Descriptors • Buffon’s Problem • Needle Length? • Needle Color? • Needle Compostion? • Needle Sheen? • Needle Orientation?

Choosing Descriptors • Constitutional • MW, Natoms • Topological • Connectivity,Weiner index • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecuar volume • Quantum Chemical • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total

Choosing Descriptors • Constitutional • MW, Natomsof element • Topological • Connectivity,Weiner index (sums of bond distances) • 2D Fingerprints (bit-strings) • 3D topographical indices, pharmacophore keys • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecular volume

Choosing Descriptors • Chemical • Hydrophobicity (LogP) • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total • DG, DS, DH

Statistical Methods • 1-D analysis • Large dimension sets require decomposition techniques • Multiple Regression • PCA • PLS • Connecting a descriptor with a structural element so as to interpolate and extrapolate data

Simple Error Analysis(1-D) • Given N data points • Mean • Variance • Regression

Simple Error Analysis(1-D) • Given N data points • Regression

Simple Error Analysis(1-D) • Given N data points • (Poor 0<R2<1(Good)

Correlation vs. Dependence? • Correlation • Two or more variables/descriptors may correlate to the same property of a system • Dependence • When the correlation can be shown due to one changing due to the change in another • Ex. Elephants head and legs • Correlation exists between size of head and legs • The size of one does not depend on the size of the other

Quantitative Structure Activity/Property Relationships (QSAR,QSPR) • Discern relationships between multiple variables (descriptors) • Identify connections between structural traits (type of substituents, bond angles substituent locale) and descriptor values (e.g. activity, LogP, % denaturation)

Pre-Qualifications • Size • Minimum of FIVE samples per descriptor • Verification • Variance • Scaling • Correlations

QSAR/QSPRPre-Qualifications • Variance • Coefficient of Variation

QSAR/QSPRPre-Qualifications • Scaling • Standardizing or normalizing descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis

QSAR/QSPRPre-Qualifications • Scaling • Unit Variance (Auto Scaling) • Ensures equal statistical weights (initially) • Mean Centering

QSAR/QSPRPre-Qualifications • Correlations • Remove correlated descriptors • Keep correlated descriptors so as to reduce data set size • Apply math operation to remove correlation (PCR)

QSAR/QSPRPre-Qualifications • Correlations

QSAR/QSPR Scheme • Goal • Predict what happens next (extrapolate)! • Predict what happens between data points (interpolate)!

QSAR/QSPR Scheme • Types of Variable • Continuous • Concentration, occupied volume, partition coefficient, hydrophobicity • Discrete • Structural (1-meta substituted, 0-no meta substitution)

QSAR/QSPR-Principal Components Analysis • Reduces dimensionality of descriptors • Principle components are a set of vectors representing the variance in the original data

QSAR/QSPR-Principal Components Analysis • Geometric Analogy (3-D to 2-D PCA) y z x

QSAR/QSPR-Principal Components Analysis • Formulate matrix • Diagonalize matrix • Eigenvectors are the principal components • These principal components (new descriptors) are a linear combination of the original descriptors • Eigenvalues represent variance • Largest accounts for greatest % of data variance • Next corresponds to second greatest and so on

QSAR/QSPR-Principal Components Analysis • Formulate matrix (Several types) • Correlation or covariance (N x P) • N is number of molecules • P is number of descriptors • Variance-Covariance matrix (N x N) • Diagonalize (Rotate) matrix

QSAR/QSPR-Principal Components Analysis • Eigenvectors (Loadings) • Represents contribution from each original descriptor to PC (new descriptor) • # columns = # of descriptors • # rows = # of descriptors OR # of molecules • Eigenvalues • Indicate which PC most important (representative of original descriptors) • Benzene has 2 non-zero and 1 zero eigenvalue (planar)

QSAR/QSPR-Principal Components Analysis • Scores • Graphing each object/molecule in space of 2 or more PCs • # rows = # of objects/molecules • # columns = # of descriptors OR # of molecules For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system

QSAR-PCASYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.) 10D3D

SYBYL (Tripos Inc.) • Eigenvalues a Explanation of variance in data

SYBYL (Tripos Inc.) • Each point corresponds to column (# points = # descriptors) in original data Proximitya correlation

SYBYL (Tripos Inc.) • Each point corresponds to row of original data (i.e. #points = #molecules) or graph of molecules in PC space Proximity a similarity Small acting Big H2O Napthalene He Molecular Size

SYBYL (Tripos Inc.) Outlier

SYBYL (Tripos Inc.)

QSAR/QSPR-Regression Types • Principal Component Analysis

Non-Linear Mappings • Calculate “distance” between points in N-d descriptor/parameter space • Euclidean • City-block distances • Randomly assign compounds in set to points on a 2-D or 3-D space • Minimize Difference (Optimal N-d 2D plot)

Non-Linear Mappings • Advantages • Non-linear • No assumptions! • Chance groupings unlikely (2D group likely an N-D group) • Disadvantages • Dependence on initial guess (Use PCA scores to improve)

QSAR/QSPR-Regression Types • Multiple Regression • PCR • PLS

QSAR/QSPR-Regression Types • Linear Regression • Minimize difference between calculated and observed values (residuals) Multiple Regression

QSAR/QSPR-Regression Types • Principal Component Regression • Regression but with Principal Components substituted for original descriptors/variables

QSAR/QSPR-Regression Types • Partial Least Squares • Cross-validation determines number of descriptors/components to use • Derive equation • Use bootstrapping and t-test to test coefficients in QSAR regression

QSAR/QSPR-Regression Types • Partial Least Squares (a.k.a. Projection to Latent Structures) • Regression of a Regression • Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s) • The ti’s are orthogonal • M= (# of variables/descriptors OR #observations/molecules whichever smaller)

QSAR/QSPR-Regression Types • PLS is NOT MR or PCR in practice • PLS is MR w/cross-validation • PLS Faster • couples the target representation (QSAR generation) and component generation while PCA and PCR are separate • PLS well applied to multi-variate problems

QSAR/QSPRPost-Qualifications • Confidence in Regression • TSS-Total Sum of Squares • ESS-Explained Sum of Squares • RSS-Residual Sum of Squares

QSAR/QSPRPost-Qualifications • Confidence in Prediction (Predictive Error Sum of Squares)

QSAR/QSPRPost-Qualification • Bias? • Bootstrapping • Choosing best model? • Cross Validation

QSAR/QSPRPost-Qualification • Bootstrapping • ASSUME calculated data is experimental/observed data • Randomly choose N data (allowing for a multiple picks of same data) • Regenerate parameters/regression • Repeat M times • Average over M bootstraps • Compare (calculate residual) • If close to zero then no bias • If large then bias exists M is typically 50-100

Molecular Modeling: Statistical Analysis of Complex Data