1 / 48

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON. John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho. MULTIVARITE CALIBRATION MODEL. y ( m  1) quantitative information of prediction property for m samples

berit
Download Presentation

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho

  2. MULTIVARITE CALIBRATION MODEL • y (m  1) • quantitative information of prediction property for m samples • X (m  p) • respective values for p predictor variables (wavelengths for spectral data) • b (p  1) • unknown regression coefficients • e (m  1) • errors with mean zero and covariance σ2I

  3. REGRESSION VECTOR SOLUTION • MLR solution, requires m≥ p (variable selection) and nearly orthogonal X • Biased regression methods require selection of meta-parameter(s)

  4. BIASED MODELING METHODS • PLS • PCR • Ridge regression (RR) • Generalized RR • Cyclic subspace regression • Continuum regression • Ridge PCR and PLS • Generalized ridge PCR and PLS • Etc.

  5. GENERIC EXPRESSION • wherek = rank(X) ≤min(m,p)

  6. PCR, RR, AND PLS FILTER VALUES • PCR: fi = 1 for retained basis vectors and fi = 0 for deleted basis vectors • RR: 0 ≤ fi ≤ 1 depending on ridge value • PLS: 0 ≤ fi < ∞ depending on PLS factor model

  7. RR AND PLS FILTER VALUES • RR • PLS • dθj are the eigenvalues of XTX restricted to Krylov subspace

  8. A CALIBRATION ASSESSMENT PROBLEM • s2 (σ2) is estimated by MSEC • Need degrees of freedom or fitting degrees of freedom (df) • df = p for the particular MLR model • requires m > p

  9. MORE df PROBLEMS • df = d, the number of factors (basis vectors) for PCR and PLS • but models can be represented in any basis set • the same model in different basis sets requires different number of basis vectors • RR and others are not factor based and/or use multiple meta-parameters

  10. ANOTHER CALIBRATION ASSESSMENT PROBLEM • Useful to plot results for different modeling methods on one plot • Example: a plot of RMSEV against number of factors (basis vectors) is possible for PCR and PLS • RR cannot be included in plot • still have improper comparison of PCR and PLS as factors are in different basis sets

  11. NECESSITY • Effective-rank (ER) for inter-model comparison of from y = Xb where is from factor based methods such as PCR or PLS, non-factor based methods such as RR, and/or methods based on multiple meta-parameters • smaller ER, more parsimonious model ?

  12. SOLUTIONS • Develop ER in a common basis set using information on how the basis vectors are used • Develop ER that is basis set independent

  13. COMMON BASIS SET • Use filter values ( fi ) in eigenvector basis set V • f ER = • Gilliam, et al., Inverse Problems, 6 (1990) 725 • f ER = • Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)

  14. BASIS SET INDEPENDENT • hi = change in the fitted value depending on the change in the observed value y • the larger the hi, the more will change if y changes (fluctuations around the expected value due to random noise) • Add normally distributed noise δ to yN times • Obtain vectors from models with perturbed y • Calculate for the ith sample (sensitivity of a fitted value to perturbation in the respective observed value) as the regression slope to: • Ye, Journal American Statistical Association, 93 (1998) 120

  15. BASIS SET INDEPENDENT • Van der Voet, Journal Chemometrics, 13 (1999) 111 • VDVER is based on error estimates which contain error

  16. BASIS SET INDEPENDENT • Know for eigenvector basis set V, PLS basis set T, and std. basis set I with β, δ, and γ being respective weight vectors for a model in that basis set • Eigenvector basis set: • PLS basis set: • Std. basis set:

  17. DATA SETS • CARBONIC ANHYDRASE (CA) INHIBITORS: CA contributes to production of eye humor which with excess secretion causes permanent damage and diseases (macular edema and open-angle glaucoma). • 142 compounds assayed for inhibition of CA isozymes CA I, CA II, & CA IV. Inhibition values Log(Ki) modeled with 63 (full) & 8 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Chem. Inf. Comput. Sci., 42 (2002) 94) • DIHYDROFOLATE REDUCTASE (DHFR) INHIBITORS: inhibition of DHFR important in combating diseases from pathogens Pneumocystis carinii (pc) and Toxoplasma gondii (tg) in unhealthy immune systems • 334, 320, & 340 compounds assayed for inhibition of (pc) DHFR, (tg) DHFR, & mammalian standard rlDHFR. Log of 50% inhibition concentration values (IC50) modeled with 84, 83, & 84 (full) & 10 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Mol. Graphics Modeling, 21 (2002) 391)

  18. CA IV (FULL): PLS & PCR RMSEC (df = d) AGAINST d

  19. PCR (df = d = fER), PLS (df = d),& PLS (df = fER)

  20. PCR & PLS (df = fER)AGAINST fER fER

  21. PCR, PLS, & RR (df = fER) fER

  22. CA IV (FULL): PLS & PCR RMSEV AGAINST d

  23. PLS & PCR AGAINST fER fER

  24. PLS, PCR, & RR AGAINST fER fER

  25. bias variance Model complexity BIAS/VARIANCE CONSIDERATION Prediction Error

  26. GENERAL TIKHONOV REGULARIZATION • λ is meta-parameter that must be optimized • L is a matrix of values, usually a derivative operator • Tikhonov, Soviet Math. Dokl., 4 (1963)1035 • L can be the spectral error covariance matrix for removal of undesired spectral variation (wavelength selection) • Kalivas, Anal. Chim. Acta, 505 (2004) 9

  27. STANDARDIZED TIKHONOV REGULARIZATION • Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)

  28. STANDARDIZED TIKHONOV REGULARIZATION • Simple case: L is square and invertible • L = I • RR

  29. HARMONIOUS (PARETO) PLOT • For graphical characterization of Tikhonov regularization, plot a variance indicator against a bias criterion to reduce the chance of overfitting or underfitting • Curve will have an L-shape (L-curve) • Ideal model at corner with the proper bias/variance trade-off (harmonious model) • PCR and PLS: best number of factors • RR: best ridge value • etc. • Intra- and inter-model comparison • Lawson, et.al., Solving Least-Squares Problems. Prentice-Hall, (1974)

  30. overfitting best model underfitting EXAMPLE PLOT

  31. VARIANCE EXPRESSIONS • Faber, et al., Journal of Chemometrics, 11 (1997)181 • Lorber, et al., Journal of Chemometrics, 2 (1988)93

  32. EXPERIMENTAL APPROACH • Intra- and inter-model comparison of RR, PLS, and PCR with QSAR data for the most harmonious and parsimonious models • LOOCV tends to overfit • Use mean values from LMOCV • Data sets randomly split 300 times with v validation and m – v calibration samples where v≈ 0.6m • Shao, J., J. Am. Statist. Assoc., 88 (1993) 486-494

  33. CA IV HARMONIOUS RR(750), PLS(6), AND PCR(8) PLOTS FOR 63 DESCRIPTORS ridge value range: 45 - 7050 RMSEC RMSEV

  34. GENERAL APPROACH TO OPTIMIZATION OF PARETO CURVE • for basis set V and weights βor any basis set with respective weights • Use an optimization algorithm (simplex, simulated annealing, etc.) adjusting weight values in β while minimizing the distance to target values of variance and bias measures • Models converge to RR models

  35. CA IV MODEL VALUES FOR 63 DESCRIPTORS

  36. CA IV HARMONIOUS RR(6), PLS(4), AND PCR(5) PLOTS FOR 8 DESCRIPTORS ridge value range: 0.2 - 126 RMSEC RMSEV

  37. CA IV MODEL VALUES FOR 8 DESCRIPTORS

  38. CA IV MODEL VALUES

  39. CA IV HARMONY/PARSIMONY PLOTS: PLS(6) AND PCR(8) FOR 63 DESCRIPTORS fER fER RMSEV RMSEV

  40. CA IV HARMONY/PARSIMONY PLOTS: PLS(6) 63 DESCRIPTORS AND PLS(4) 8 DESCRIPTORS RMSEV fER

  41. CA I MODEL VALUES

  42. tgDHFR MODEL VALUES USING 10 DESCRIPTORS

  43. pcDHFR MODEL VALUES USING 10 DESCRIPTORS

  44. SUMMARY • ER necessary for fair intra- and inter-model comparison • RMSEC and RMSEV plot overlays are possible for different modeling methods • Harmonious plots allow proper determination of meta-parameters and validation • Fair intra- and inter-model comparisons are possible (plot overlays are possible) • In optimal model region of harmonious curve, differences in models are small • ER assesses the true nature of variable selection for improved parsimony • Harmony/parsimony compromise

  45. FUTURE WORK • Use ER with multiple variance and bias indicators for better characterization of the harmony/parsimony tradeoff for intra- and inter-model comparison with full and/or variable subsets • Include variable selection in the modeling process

  46. Include L = second derivative operator in Tikhonov regularization • a form of RR with smoothing • smooth spectral noise and temperature influences • Use standardization approach with PCR and PLS • PCR: • PLS:

  47. ACKNOWLEDGEMENTS • Forrest Stout and Heather Seipel • Peter Jurs and Brian Mattioni provided QSAR data sets • National Science Foundation

  48. STANDARDIZATION PROCESS • For with rank(L) = s < p, obtain a QR factorization of LT • Form and perform a QR factorization of XKo • Compute standardized data • Perform back-transformation

More Related