1 / 22

Subset Selection Problem

Subset Selection Problem. Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow. Outline. Introduction. What is representative subset ? Training set and Test set Influential subset selection Boundary subset

ling
Download Presentation

Subset Selection Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow

  2. Outline • Introduction. What is representative subset ? • Training set and Test set • Influential subset selection Boundary subset Kennard-Stone subset Models’ comparison • Conclusions

  3. XI XIII YIII XII YII What is representative subset? X Y Model Y(X) +

  4. ModelI(A factors) RMSEP1 ModelII(A factors) RMSEP2 Influential Subset Training set X(nm), Y(nk) ModelI(A factors) l<n Influential Subset X(lm), Y(lk) ModelII(A factors) Model 2 ~ Model 1 ? Quality of prediction

  5. Training and Test Sets Training Set N Entire Data Set K Test Set K-N

  6. Generalization of Bartlett’s test Hotelling T2-test + Statistical Tests Similar position in space Clouds orientation Dispersion around their means D. Jouan-Rimbaud, D.L.Massart, C.A. Saby, C. Puel Characterisation of the representativity of selected sets of samples in multivariate calibration and pattern recognition, Analitica Chimica Acta 350 (1997) 149-161

  7. RPV Influential Subset  Boundary Samples

  8. X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. Training set = 99 objects Test set = 40 objects PLS-model, 4PCs SIC-modeling bsic=1.5 Whole Wheat Samples(Data description)

  9. Boundary samples Training set nm n=99 Model 1 ‘Redundant subset’ n-l=80 Boundary subset l=19

  10. Boundary Subset Training set Model 1 Boundary subset Model 2 4 PLS comp-s =1.5 n=99 l=19 TEST SET

  11. Model1 (Training set) Test set Model 2 (Boundary subset) Test set SIC prediction

  12. Quality of prediction (PLS models) ? Model 1 (Training set) Test set Model 2 (Boundary set) Test set RMSEC=0.303 RMSEP=0.337 Mean(Cal. Leverage)=0.051 Maximum(Cal. Leverage)=0.25 RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45

  13. djr , j=1,...k, is the square Euclidean distance from candidate object r,to the k objects in the subset Kennard-Stone Method Objects are chosen sequentially in X or T space Aim Select samples that are uniformly distributed over predictors’ space

  14. Boundary subset Model 2 Kennard-Stone Subset Training set n=99 Model 1 4 PLS comp-s K-S subset l=19 Model 3

  15. Boundary Subset & K-S Subset (SIC prediction)

  16. Boundary Subset & K-S Subset (PLS models) Model 2 (Boundary set) Test set RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 3 (K-S set) Test set RMSEC=0.229 RMSEP=0.368 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.73

  17. Redundant Set (RS_3) N-L=80 Redundant Set (RS_2) N-L=80 Training set N=99 Model 1 PLS Cs=4 bsic=1.5 Boundary set L=19 Model 2 Kennard-Stone set L=19 Model 3 Test set N1=40 ‘Redundant samples’

  18. RMSEP=0.267 RMSEP=0.338 Prediction of Redundant Sets Model 2 (Boundary set)RS_2 Model 3 (K-S set) RS_3

  19. Entire Data Set 139 objects Training Set 99 objects Test Set 40 objects In Average Model comparison Randomly 10 times

  20. Questions • Prediction ability, how to evaluate it? • Representativity, how to verify it? Conclusions • The model constructed with the help of Boundary Subset can predict all other samples with accuracy that is not worse than the error of calibration evaluated on the whole data set. • Boundary Subset is indeed significantly smaller than the whole Training Set.

More Related