DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho

MULTIVARITE CALIBRATION MODEL • y (m  1) • quantitative information of prediction property for m samples • X (m  p) • respective values for p predictor variables (wavelengths for spectral data) • b (p  1) • unknown regression coefficients • e (m  1) • errors with mean zero and covariance σ2I

REGRESSION VECTOR SOLUTION • MLR solution, requires m≥ p (variable selection) and nearly orthogonal X • Biased regression methods require selection of meta-parameter(s)

BIASED MODELING METHODS • PLS • PCR • Ridge regression (RR) • Generalized RR • Cyclic subspace regression • Continuum regression • Ridge PCR and PLS • Generalized ridge PCR and PLS • Etc.

GENERIC EXPRESSION • wherek = rank(X) ≤min(m,p)

PCR, RR, AND PLS FILTER VALUES • PCR: fi = 1 for retained basis vectors and fi = 0 for deleted basis vectors • RR: 0 ≤ fi ≤ 1 depending on ridge value • PLS: 0 ≤ fi < ∞ depending on PLS factor model

RR AND PLS FILTER VALUES • RR • PLS • dθj are the eigenvalues of XTX restricted to Krylov subspace

A CALIBRATION ASSESSMENT PROBLEM • s2 (σ2) is estimated by MSEC • Need degrees of freedom or fitting degrees of freedom (df) • df = p for the particular MLR model • requires m > p

MORE df PROBLEMS • df = d, the number of factors (basis vectors) for PCR and PLS • but models can be represented in any basis set • the same model in different basis sets requires different number of basis vectors • RR and others are not factor based and/or use multiple meta-parameters

ANOTHER CALIBRATION ASSESSMENT PROBLEM • Useful to plot results for different modeling methods on one plot • Example: a plot of RMSEV against number of factors (basis vectors) is possible for PCR and PLS • RR cannot be included in plot • still have improper comparison of PCR and PLS as factors are in different basis sets

NECESSITY • Effective-rank (ER) for inter-model comparison of from y = Xb where is from factor based methods such as PCR or PLS, non-factor based methods such as RR, and/or methods based on multiple meta-parameters • smaller ER, more parsimonious model ?

SOLUTIONS • Develop ER in a common basis set using information on how the basis vectors are used • Develop ER that is basis set independent

COMMON BASIS SET • Use filter values ( fi ) in eigenvector basis set V • f ER = • Gilliam, et al., Inverse Problems, 6 (1990) 725 • f ER = • Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)

BASIS SET INDEPENDENT • hi = change in the fitted value depending on the change in the observed value y • the larger the hi, the more will change if y changes (fluctuations around the expected value due to random noise) • Add normally distributed noise δ to yN times • Obtain vectors from models with perturbed y • Calculate for the ith sample (sensitivity of a fitted value to perturbation in the respective observed value) as the regression slope to: • Ye, Journal American Statistical Association, 93 (1998) 120

BASIS SET INDEPENDENT • Van der Voet, Journal Chemometrics, 13 (1999) 111 • VDVER is based on error estimates which contain error

BASIS SET INDEPENDENT • Know for eigenvector basis set V, PLS basis set T, and std. basis set I with β, δ, and γ being respective weight vectors for a model in that basis set • Eigenvector basis set: • PLS basis set: • Std. basis set:

DATA SETS • CARBONIC ANHYDRASE (CA) INHIBITORS: CA contributes to production of eye humor which with excess secretion causes permanent damage and diseases (macular edema and open-angle glaucoma). • 142 compounds assayed for inhibition of CA isozymes CA I, CA II, & CA IV. Inhibition values Log(Ki) modeled with 63 (full) & 8 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Chem. Inf. Comput. Sci., 42 (2002) 94) • DIHYDROFOLATE REDUCTASE (DHFR) INHIBITORS: inhibition of DHFR important in combating diseases from pathogens Pneumocystis carinii (pc) and Toxoplasma gondii (tg) in unhealthy immune systems • 334, 320, & 340 compounds assayed for inhibition of (pc) DHFR, (tg) DHFR, & mammalian standard rlDHFR. Log of 50% inhibition concentration values (IC50) modeled with 84, 83, & 84 (full) & 10 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Mol. Graphics Modeling, 21 (2002) 391)

CA IV (FULL): PLS & PCR RMSEC (df = d) AGAINST d

PCR (df = d = fER), PLS (df = d),& PLS (df = fER)

PCR & PLS (df = fER)AGAINST fER fER

PCR, PLS, & RR (df = fER) fER

CA IV (FULL): PLS & PCR RMSEV AGAINST d

PLS & PCR AGAINST fER fER

PLS, PCR, & RR AGAINST fER fER

bias variance Model complexity BIAS/VARIANCE CONSIDERATION Prediction Error

GENERAL TIKHONOV REGULARIZATION • λ is meta-parameter that must be optimized • L is a matrix of values, usually a derivative operator • Tikhonov, Soviet Math. Dokl., 4 (1963)1035 • L can be the spectral error covariance matrix for removal of undesired spectral variation (wavelength selection) • Kalivas, Anal. Chim. Acta, 505 (2004) 9

STANDARDIZED TIKHONOV REGULARIZATION • Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)

STANDARDIZED TIKHONOV REGULARIZATION • Simple case: L is square and invertible • L = I • RR

HARMONIOUS (PARETO) PLOT • For graphical characterization of Tikhonov regularization, plot a variance indicator against a bias criterion to reduce the chance of overfitting or underfitting • Curve will have an L-shape (L-curve) • Ideal model at corner with the proper bias/variance trade-off (harmonious model) • PCR and PLS: best number of factors • RR: best ridge value • etc. • Intra- and inter-model comparison • Lawson, et.al., Solving Least-Squares Problems. Prentice-Hall, (1974)

overfitting best model underfitting EXAMPLE PLOT

VARIANCE EXPRESSIONS • Faber, et al., Journal of Chemometrics, 11 (1997)181 • Lorber, et al., Journal of Chemometrics, 2 (1988)93

EXPERIMENTAL APPROACH • Intra- and inter-model comparison of RR, PLS, and PCR with QSAR data for the most harmonious and parsimonious models • LOOCV tends to overfit • Use mean values from LMOCV • Data sets randomly split 300 times with v validation and m – v calibration samples where v≈ 0.6m • Shao, J., J. Am. Statist. Assoc., 88 (1993) 486-494

CA IV HARMONIOUS RR(750), PLS(6), AND PCR(8) PLOTS FOR 63 DESCRIPTORS ridge value range: 45 - 7050 RMSEC RMSEV

GENERAL APPROACH TO OPTIMIZATION OF PARETO CURVE • for basis set V and weights βor any basis set with respective weights • Use an optimization algorithm (simplex, simulated annealing, etc.) adjusting weight values in β while minimizing the distance to target values of variance and bias measures • Models converge to RR models

CA IV MODEL VALUES FOR 63 DESCRIPTORS

CA IV HARMONIOUS RR(6), PLS(4), AND PCR(5) PLOTS FOR 8 DESCRIPTORS ridge value range: 0.2 - 126 RMSEC RMSEV

CA IV MODEL VALUES FOR 8 DESCRIPTORS

CA IV MODEL VALUES

CA IV HARMONY/PARSIMONY PLOTS: PLS(6) AND PCR(8) FOR 63 DESCRIPTORS fER fER RMSEV RMSEV

CA IV HARMONY/PARSIMONY PLOTS: PLS(6) 63 DESCRIPTORS AND PLS(4) 8 DESCRIPTORS RMSEV fER

CA I MODEL VALUES

tgDHFR MODEL VALUES USING 10 DESCRIPTORS

pcDHFR MODEL VALUES USING 10 DESCRIPTORS

SUMMARY • ER necessary for fair intra- and inter-model comparison • RMSEC and RMSEV plot overlays are possible for different modeling methods • Harmonious plots allow proper determination of meta-parameters and validation • Fair intra- and inter-model comparisons are possible (plot overlays are possible) • In optimal model region of harmonious curve, differences in models are small • ER assesses the true nature of variable selection for improved parsimony • Harmony/parsimony compromise

FUTURE WORK • Use ER with multiple variance and bias indicators for better characterization of the harmony/parsimony tradeoff for intra- and inter-model comparison with full and/or variable subsets • Include variable selection in the modeling process

Include L = second derivative operator in Tikhonov regularization • a form of RR with smoothing • smooth spectral noise and temperature influences • Use standardization approach with PCR and PLS • PCR: • PLS:

ACKNOWLEDGEMENTS • Forrest Stout and Heather Seipel • Peter Jurs and Brian Mattioni provided QSAR data sets • National Science Foundation

STANDARDIZATION PROCESS • For with rank(L) = s < p, obtain a QR factorization of LT • Form and perform a QR factorization of XKo • Compute standardized data • Perform back-transformation

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

Presentation Transcript

Model Selection/Comparison

Model Selection

CELL MODEL CALIBRATION

Model Uncertainty and Model Selection

Model Selection

Model Calibration

Calibration and Model Discrepancy

Model calibration and validation

Model selection

Model Selection

Model calibration using

Model Selection

Model Comparison:

Model Calibration and Weighting

Model Evaluation and Comparison

Model Selection

Model Calibration and Validation

Model selection

Sacramento Model Calibration

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

Model selection and model building

Telephoto Lens Calibration and Model Complexity Selection