480 likes | 697 Views
DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON. John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho. MULTIVARITE CALIBRATION MODEL. y ( m 1) quantitative information of prediction property for m samples
E N D
DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho
MULTIVARITE CALIBRATION MODEL • y (m 1) • quantitative information of prediction property for m samples • X (m p) • respective values for p predictor variables (wavelengths for spectral data) • b (p 1) • unknown regression coefficients • e (m 1) • errors with mean zero and covariance σ2I
REGRESSION VECTOR SOLUTION • MLR solution, requires m≥ p (variable selection) and nearly orthogonal X • Biased regression methods require selection of meta-parameter(s)
BIASED MODELING METHODS • PLS • PCR • Ridge regression (RR) • Generalized RR • Cyclic subspace regression • Continuum regression • Ridge PCR and PLS • Generalized ridge PCR and PLS • Etc.
GENERIC EXPRESSION • wherek = rank(X) ≤min(m,p)
PCR, RR, AND PLS FILTER VALUES • PCR: fi = 1 for retained basis vectors and fi = 0 for deleted basis vectors • RR: 0 ≤ fi ≤ 1 depending on ridge value • PLS: 0 ≤ fi < ∞ depending on PLS factor model
RR AND PLS FILTER VALUES • RR • PLS • dθj are the eigenvalues of XTX restricted to Krylov subspace
A CALIBRATION ASSESSMENT PROBLEM • s2 (σ2) is estimated by MSEC • Need degrees of freedom or fitting degrees of freedom (df) • df = p for the particular MLR model • requires m > p
MORE df PROBLEMS • df = d, the number of factors (basis vectors) for PCR and PLS • but models can be represented in any basis set • the same model in different basis sets requires different number of basis vectors • RR and others are not factor based and/or use multiple meta-parameters
ANOTHER CALIBRATION ASSESSMENT PROBLEM • Useful to plot results for different modeling methods on one plot • Example: a plot of RMSEV against number of factors (basis vectors) is possible for PCR and PLS • RR cannot be included in plot • still have improper comparison of PCR and PLS as factors are in different basis sets
NECESSITY • Effective-rank (ER) for inter-model comparison of from y = Xb where is from factor based methods such as PCR or PLS, non-factor based methods such as RR, and/or methods based on multiple meta-parameters • smaller ER, more parsimonious model ?
SOLUTIONS • Develop ER in a common basis set using information on how the basis vectors are used • Develop ER that is basis set independent
COMMON BASIS SET • Use filter values ( fi ) in eigenvector basis set V • f ER = • Gilliam, et al., Inverse Problems, 6 (1990) 725 • f ER = • Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)
BASIS SET INDEPENDENT • hi = change in the fitted value depending on the change in the observed value y • the larger the hi, the more will change if y changes (fluctuations around the expected value due to random noise) • Add normally distributed noise δ to yN times • Obtain vectors from models with perturbed y • Calculate for the ith sample (sensitivity of a fitted value to perturbation in the respective observed value) as the regression slope to: • Ye, Journal American Statistical Association, 93 (1998) 120
BASIS SET INDEPENDENT • Van der Voet, Journal Chemometrics, 13 (1999) 111 • VDVER is based on error estimates which contain error
BASIS SET INDEPENDENT • Know for eigenvector basis set V, PLS basis set T, and std. basis set I with β, δ, and γ being respective weight vectors for a model in that basis set • Eigenvector basis set: • PLS basis set: • Std. basis set:
DATA SETS • CARBONIC ANHYDRASE (CA) INHIBITORS: CA contributes to production of eye humor which with excess secretion causes permanent damage and diseases (macular edema and open-angle glaucoma). • 142 compounds assayed for inhibition of CA isozymes CA I, CA II, & CA IV. Inhibition values Log(Ki) modeled with 63 (full) & 8 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Chem. Inf. Comput. Sci., 42 (2002) 94) • DIHYDROFOLATE REDUCTASE (DHFR) INHIBITORS: inhibition of DHFR important in combating diseases from pathogens Pneumocystis carinii (pc) and Toxoplasma gondii (tg) in unhealthy immune systems • 334, 320, & 340 compounds assayed for inhibition of (pc) DHFR, (tg) DHFR, & mammalian standard rlDHFR. Log of 50% inhibition concentration values (IC50) modeled with 84, 83, & 84 (full) & 10 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Mol. Graphics Modeling, 21 (2002) 391)
bias variance Model complexity BIAS/VARIANCE CONSIDERATION Prediction Error
GENERAL TIKHONOV REGULARIZATION • λ is meta-parameter that must be optimized • L is a matrix of values, usually a derivative operator • Tikhonov, Soviet Math. Dokl., 4 (1963)1035 • L can be the spectral error covariance matrix for removal of undesired spectral variation (wavelength selection) • Kalivas, Anal. Chim. Acta, 505 (2004) 9
STANDARDIZED TIKHONOV REGULARIZATION • Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)
STANDARDIZED TIKHONOV REGULARIZATION • Simple case: L is square and invertible • L = I • RR
HARMONIOUS (PARETO) PLOT • For graphical characterization of Tikhonov regularization, plot a variance indicator against a bias criterion to reduce the chance of overfitting or underfitting • Curve will have an L-shape (L-curve) • Ideal model at corner with the proper bias/variance trade-off (harmonious model) • PCR and PLS: best number of factors • RR: best ridge value • etc. • Intra- and inter-model comparison • Lawson, et.al., Solving Least-Squares Problems. Prentice-Hall, (1974)
overfitting best model underfitting EXAMPLE PLOT
VARIANCE EXPRESSIONS • Faber, et al., Journal of Chemometrics, 11 (1997)181 • Lorber, et al., Journal of Chemometrics, 2 (1988)93
EXPERIMENTAL APPROACH • Intra- and inter-model comparison of RR, PLS, and PCR with QSAR data for the most harmonious and parsimonious models • LOOCV tends to overfit • Use mean values from LMOCV • Data sets randomly split 300 times with v validation and m – v calibration samples where v≈ 0.6m • Shao, J., J. Am. Statist. Assoc., 88 (1993) 486-494
CA IV HARMONIOUS RR(750), PLS(6), AND PCR(8) PLOTS FOR 63 DESCRIPTORS ridge value range: 45 - 7050 RMSEC RMSEV
GENERAL APPROACH TO OPTIMIZATION OF PARETO CURVE • for basis set V and weights βor any basis set with respective weights • Use an optimization algorithm (simplex, simulated annealing, etc.) adjusting weight values in β while minimizing the distance to target values of variance and bias measures • Models converge to RR models
CA IV HARMONIOUS RR(6), PLS(4), AND PCR(5) PLOTS FOR 8 DESCRIPTORS ridge value range: 0.2 - 126 RMSEC RMSEV
CA IV HARMONY/PARSIMONY PLOTS: PLS(6) AND PCR(8) FOR 63 DESCRIPTORS fER fER RMSEV RMSEV
CA IV HARMONY/PARSIMONY PLOTS: PLS(6) 63 DESCRIPTORS AND PLS(4) 8 DESCRIPTORS RMSEV fER
SUMMARY • ER necessary for fair intra- and inter-model comparison • RMSEC and RMSEV plot overlays are possible for different modeling methods • Harmonious plots allow proper determination of meta-parameters and validation • Fair intra- and inter-model comparisons are possible (plot overlays are possible) • In optimal model region of harmonious curve, differences in models are small • ER assesses the true nature of variable selection for improved parsimony • Harmony/parsimony compromise
FUTURE WORK • Use ER with multiple variance and bias indicators for better characterization of the harmony/parsimony tradeoff for intra- and inter-model comparison with full and/or variable subsets • Include variable selection in the modeling process
Include L = second derivative operator in Tikhonov regularization • a form of RR with smoothing • smooth spectral noise and temperature influences • Use standardization approach with PCR and PLS • PCR: • PLS:
ACKNOWLEDGEMENTS • Forrest Stout and Heather Seipel • Peter Jurs and Brian Mattioni provided QSAR data sets • National Science Foundation
STANDARDIZATION PROCESS • For with rank(L) = s < p, obtain a QR factorization of LT • Form and perform a QR factorization of XKo • Compute standardized data • Perform back-transformation