230 likes | 407 Views
Subset Selection Problem. Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow. Outline. Introduction. What is representative subset ? Training set and Test set Influential subset selection Boundary subset
E N D
Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow
Outline • Introduction. What is representative subset ? • Training set and Test set • Influential subset selection Boundary subset Kennard-Stone subset Models’ comparison • Conclusions
XI XIII YIII XII YII What is representative subset? X Y Model Y(X) +
ModelI(A factors) RMSEP1 ModelII(A factors) RMSEP2 Influential Subset Training set X(nm), Y(nk) ModelI(A factors) l<n Influential Subset X(lm), Y(lk) ModelII(A factors) Model 2 ~ Model 1 ? Quality of prediction
Training and Test Sets Training Set N Entire Data Set K Test Set K-N
Generalization of Bartlett’s test Hotelling T2-test + Statistical Tests Similar position in space Clouds orientation Dispersion around their means D. Jouan-Rimbaud, D.L.Massart, C.A. Saby, C. Puel Characterisation of the representativity of selected sets of samples in multivariate calibration and pattern recognition, Analitica Chimica Acta 350 (1997) 149-161
RPV Influential Subset Boundary Samples
X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. Training set = 99 objects Test set = 40 objects PLS-model, 4PCs SIC-modeling bsic=1.5 Whole Wheat Samples(Data description)
Boundary samples Training set nm n=99 Model 1 ‘Redundant subset’ n-l=80 Boundary subset l=19
Boundary Subset Training set Model 1 Boundary subset Model 2 4 PLS comp-s =1.5 n=99 l=19 TEST SET
Model1 (Training set) Test set Model 2 (Boundary subset) Test set SIC prediction
Quality of prediction (PLS models) ? Model 1 (Training set) Test set Model 2 (Boundary set) Test set RMSEC=0.303 RMSEP=0.337 Mean(Cal. Leverage)=0.051 Maximum(Cal. Leverage)=0.25 RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45
djr , j=1,...k, is the square Euclidean distance from candidate object r,to the k objects in the subset Kennard-Stone Method Objects are chosen sequentially in X or T space Aim Select samples that are uniformly distributed over predictors’ space
Boundary subset Model 2 Kennard-Stone Subset Training set n=99 Model 1 4 PLS comp-s K-S subset l=19 Model 3
Boundary Subset & K-S Subset (PLS models) Model 2 (Boundary set) Test set RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 3 (K-S set) Test set RMSEC=0.229 RMSEP=0.368 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.73
Redundant Set (RS_3) N-L=80 Redundant Set (RS_2) N-L=80 Training set N=99 Model 1 PLS Cs=4 bsic=1.5 Boundary set L=19 Model 2 Kennard-Stone set L=19 Model 3 Test set N1=40 ‘Redundant samples’
RMSEP=0.267 RMSEP=0.338 Prediction of Redundant Sets Model 2 (Boundary set)RS_2 Model 3 (K-S set) RS_3
Entire Data Set 139 objects Training Set 99 objects Test Set 40 objects In Average Model comparison Randomly 10 times
Questions • Prediction ability, how to evaluate it? • Representativity, how to verify it? Conclusions • The model constructed with the help of Boundary Subset can predict all other samples with accuracy that is not worse than the error of calibration evaluated on the whole data set. • Boundary Subset is indeed significantly smaller than the whole Training Set.