Subset Selection Problem

Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow

Outline • Introduction. What is representative subset ? • Training set and Test set • Influential subset selection Boundary subset Kennard-Stone subset Models’ comparison • Conclusions

XI XIII YIII XII YII What is representative subset? X Y Model Y(X) +

ModelI(A factors) RMSEP1 ModelII(A factors) RMSEP2 Influential Subset Training set X(nm), Y(nk) ModelI(A factors) l<n Influential Subset X(lm), Y(lk) ModelII(A factors) Model 2 ~ Model 1 ? Quality of prediction

Training and Test Sets Training Set N Entire Data Set K Test Set K-N

Generalization of Bartlett’s test Hotelling T2-test + Statistical Tests Similar position in space Clouds orientation Dispersion around their means D. Jouan-Rimbaud, D.L.Massart, C.A. Saby, C. Puel Characterisation of the representativity of selected sets of samples in multivariate calibration and pattern recognition, Analitica Chimica Acta 350 (1997) 149-161

RPV Influential Subset  Boundary Samples

X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. Training set = 99 objects Test set = 40 objects PLS-model, 4PCs SIC-modeling bsic=1.5 Whole Wheat Samples(Data description)

Boundary samples Training set nm n=99 Model 1 ‘Redundant subset’ n-l=80 Boundary subset l=19

Boundary Subset Training set Model 1 Boundary subset Model 2 4 PLS comp-s =1.5 n=99 l=19 TEST SET

Model1 (Training set) Test set Model 2 (Boundary subset) Test set SIC prediction

Quality of prediction (PLS models) ? Model 1 (Training set) Test set Model 2 (Boundary set) Test set RMSEC=0.303 RMSEP=0.337 Mean(Cal. Leverage)=0.051 Maximum(Cal. Leverage)=0.25 RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45

djr , j=1,...k, is the square Euclidean distance from candidate object r,to the k objects in the subset Kennard-Stone Method Objects are chosen sequentially in X or T space Aim Select samples that are uniformly distributed over predictors’ space

Boundary subset Model 2 Kennard-Stone Subset Training set n=99 Model 1 4 PLS comp-s K-S subset l=19 Model 3

Boundary Subset & K-S Subset (SIC prediction)

Boundary Subset & K-S Subset (PLS models) Model 2 (Boundary set) Test set RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 3 (K-S set) Test set RMSEC=0.229 RMSEP=0.368 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.73

Redundant Set (RS_3) N-L=80 Redundant Set (RS_2) N-L=80 Training set N=99 Model 1 PLS Cs=4 bsic=1.5 Boundary set L=19 Model 2 Kennard-Stone set L=19 Model 3 Test set N1=40 ‘Redundant samples’

RMSEP=0.267 RMSEP=0.338 Prediction of Redundant Sets Model 2 (Boundary set)RS_2 Model 3 (K-S set) RS_3

Entire Data Set 139 objects Training Set 99 objects Test Set 40 objects In Average Model comparison Randomly 10 times

Questions • Prediction ability, how to evaluate it? • Representativity, how to verify it? Conclusions • The model constructed with the help of Boundary Subset can predict all other samples with accuracy that is not worse than the error of calibration evaluated on the whole data set. • Boundary Subset is indeed significantly smaller than the whole Training Set.

Subset Selection Problem