1 / 56

Molecular Modeling: Statistical Analysis of Complex Data

Molecular Modeling: Statistical Analysis of Complex Data. C372 Dr. Kelsey Forsythe. Terminology. SAR (Structure-Activity Relationships) Circa 19 th century? QSPR (Quantitative Structure Property Relationships) Relate structure to any physical-chemical property of molecule

albina
Download Presentation

Molecular Modeling: Statistical Analysis of Complex Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe

  2. Terminology • SAR (Structure-Activity Relationships) • Circa 19th century? • QSPR (Quantitative Structure Property Relationships) • Relate structure to any physical-chemical property of molecule • QSAR (Quantitative Structure Activity Relationships) • Specific to some biological/pharmaceutical function of molecule (Absorption, Distribution/Digestion, Metabolism, Excretion) • Brown and Frazer (1868-9) • ‘constitution’ related to biological response • LogP

  3. Statistical Models • Simple • Mean, median and variation • Regression • Advanced • Validation methods • Principal components, co-variance • Multiple Regression QSAR,QSPR

  4. Modern QSAR • Hansch et. Al. (1963) • Activity a ‘travel through body’ a partitioning between varied solvents • C (minimum dosage required) • p (hydrophobicity) • s (electronic) • Es (steric)

  5. Choosing Descriptors • Buffon’s Problem • Needle Length? • Needle Color? • Needle Compostion? • Needle Sheen? • Needle Orientation?

  6. Choosing Descriptors • Constitutional • MW, Natoms • Topological • Connectivity,Weiner index • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecuar volume • Quantum Chemical • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total

  7. Choosing Descriptors • Constitutional • MW, Natomsof element • Topological • Connectivity,Weiner index (sums of bond distances) • 2D Fingerprints (bit-strings) • 3D topographical indices, pharmacophore keys • Electrostatic • Polarity, polarizability, partial charges • Geometrical Descriptors • Length, width, Molecular volume

  8. Choosing Descriptors • Chemical • Hydrophobicity (LogP) • HOMO and LUMO energies • Vibrational frequencies • Bond orders • Energy total • DG, DS, DH

  9. Statistical Methods • 1-D analysis • Large dimension sets require decomposition techniques • Multiple Regression • PCA • PLS • Connecting a descriptor with a structural element so as to interpolate and extrapolate data

  10. Simple Error Analysis(1-D) • Given N data points • Mean • Variance • Regression

  11. Simple Error Analysis(1-D) • Given N data points • Regression

  12. Simple Error Analysis(1-D) • Given N data points • (Poor 0<R2<1(Good)

  13. Correlation vs. Dependence? • Correlation • Two or more variables/descriptors may correlate to the same property of a system • Dependence • When the correlation can be shown due to one changing due to the change in another • Ex. Elephants head and legs • Correlation exists between size of head and legs • The size of one does not depend on the size of the other

  14. Quantitative Structure Activity/Property Relationships (QSAR,QSPR) • Discern relationships between multiple variables (descriptors) • Identify connections between structural traits (type of substituents, bond angles substituent locale) and descriptor values (e.g. activity, LogP, % denaturation)

  15. Pre-Qualifications • Size • Minimum of FIVE samples per descriptor • Verification • Variance • Scaling • Correlations

  16. QSAR/QSPRPre-Qualifications • Variance • Coefficient of Variation

  17. QSAR/QSPRPre-Qualifications • Scaling • Standardizing or normalizing descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis

  18. QSAR/QSPRPre-Qualifications • Scaling • Unit Variance (Auto Scaling) • Ensures equal statistical weights (initially) • Mean Centering

  19. QSAR/QSPRPre-Qualifications • Correlations • Remove correlated descriptors • Keep correlated descriptors so as to reduce data set size • Apply math operation to remove correlation (PCR)

  20. QSAR/QSPRPre-Qualifications • Correlations

  21. QSAR/QSPR Scheme • Goal • Predict what happens next (extrapolate)! • Predict what happens between data points (interpolate)!

  22. QSAR/QSPR Scheme • Types of Variable • Continuous • Concentration, occupied volume, partition coefficient, hydrophobicity • Discrete • Structural (1-meta substituted, 0-no meta substitution)

  23. QSAR/QSPR-Principal Components Analysis • Reduces dimensionality of descriptors • Principle components are a set of vectors representing the variance in the original data

  24. QSAR/QSPR-Principal Components Analysis • Geometric Analogy (3-D to 2-D PCA) y z x

  25. QSAR/QSPR-Principal Components Analysis • Formulate matrix • Diagonalize matrix • Eigenvectors are the principal components • These principal components (new descriptors) are a linear combination of the original descriptors • Eigenvalues represent variance • Largest accounts for greatest % of data variance • Next corresponds to second greatest and so on

  26. QSAR/QSPR-Principal Components Analysis • Formulate matrix (Several types) • Correlation or covariance (N x P) • N is number of molecules • P is number of descriptors • Variance-Covariance matrix (N x N) • Diagonalize (Rotate) matrix

  27. QSAR/QSPR-Principal Components Analysis • Eigenvectors (Loadings) • Represents contribution from each original descriptor to PC (new descriptor) • # columns = # of descriptors • # rows = # of descriptors OR # of molecules • Eigenvalues • Indicate which PC most important (representative of original descriptors) • Benzene has 2 non-zero and 1 zero eigenvalue (planar)

  28. QSAR/QSPR-Principal Components Analysis • Scores • Graphing each object/molecule in space of 2 or more PCs • # rows = # of objects/molecules • # columns = # of descriptors OR # of molecules For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system

  29. QSAR-PCASYBYL (Tripos Inc.)

  30. SYBYL (Tripos Inc.)

  31. SYBYL (Tripos Inc.) 10D3D

  32. SYBYL (Tripos Inc.) • Eigenvalues a Explanation of variance in data

  33. SYBYL (Tripos Inc.) • Each point corresponds to column (# points = # descriptors) in original data Proximitya correlation

  34. SYBYL (Tripos Inc.) • Each point corresponds to row of original data (i.e. #points = #molecules) or graph of molecules in PC space Proximity a similarity Small acting Big H2O Napthalene He Molecular Size

  35. SYBYL (Tripos Inc.) Outlier

  36. SYBYL (Tripos Inc.)

  37. QSAR/QSPR-Regression Types • Principal Component Analysis

  38. QSAR/QSPR-Regression Types • Principal Component Analysis

  39. Non-Linear Mappings • Calculate “distance” between points in N-d descriptor/parameter space • Euclidean • City-block distances • Randomly assign compounds in set to points on a 2-D or 3-D space • Minimize Difference (Optimal N-d 2D plot)

  40. Non-Linear Mappings • Advantages • Non-linear • No assumptions! • Chance groupings unlikely (2D group likely an N-D group) • Disadvantages • Dependence on initial guess (Use PCA scores to improve)

  41. QSAR/QSPR-Regression Types • Multiple Regression • PCR • PLS

  42. QSAR/QSPR-Regression Types • Linear Regression • Minimize difference between calculated and observed values (residuals) Multiple Regression

  43. QSAR/QSPR-Regression Types • Principal Component Regression • Regression but with Principal Components substituted for original descriptors/variables

  44. QSAR/QSPR-Regression Types • Partial Least Squares • Cross-validation determines number of descriptors/components to use • Derive equation • Use bootstrapping and t-test to test coefficients in QSAR regression

  45. QSAR/QSPR-Regression Types • Partial Least Squares (a.k.a. Projection to Latent Structures) • Regression of a Regression • Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s) • The ti’s are orthogonal • M= (# of variables/descriptors OR #observations/molecules whichever smaller)

  46. QSAR/QSPR-Regression Types • PLS is NOT MR or PCR in practice • PLS is MR w/cross-validation • PLS Faster • couples the target representation (QSAR generation) and component generation while PCA and PCR are separate • PLS well applied to multi-variate problems

  47. QSAR/QSPRPost-Qualifications • Confidence in Regression • TSS-Total Sum of Squares • ESS-Explained Sum of Squares • RSS-Residual Sum of Squares

  48. QSAR/QSPRPost-Qualifications • Confidence in Prediction (Predictive Error Sum of Squares)

  49. QSAR/QSPRPost-Qualification • Bias? • Bootstrapping • Choosing best model? • Cross Validation

  50. QSAR/QSPRPost-Qualification • Bootstrapping • ASSUME calculated data is experimental/observed data • Randomly choose N data (allowing for a multiple picks of same data) • Regenerate parameters/regression • Repeat M times • Average over M bootstraps • Compare (calculate residual) • If close to zero then no bias • If large then bias exists M is typically 50-100

More Related