340 likes | 355 Views
4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004. 4 Th ICW. 4 Th ICW. The Problem of Factor Selection in PCA-Based Calibration Methods. By: Bahram Hemmateenejad Medicinal & Natural Products Chemistry Research Center, Shiraz University of Medical Science. 4 Th ICW. 4 Th ICW.
E N D
4Th Iranian chemometrics Workshop (ICW) Zanjan-2004
4Th ICW 4Th ICW The Problem of Factor Selection in PCA-BasedCalibration Methods By: Bahram Hemmateenejad Medicinal & Natural Products Chemistry Research Center, Shiraz University of Medical Science
4Th ICW 4Th ICW Multivariate Calibration Regression Equation relating measurements on m samples to k different variables by: y = X b y (m1): Dependent variable or Predicted Variable X (mk) : Independent variables or Predictor Variables b (k1): regression coefficient
Multicomponent Analysis y: concentration of the analyte X: Recorded analytical signals at k different channels, i.e. absorbance at different wavelength QSAR/QSPR Studies y: chemical property or biological activity X: Molecular descriptors representing structural features of molecules by number 4Th ICW 4Th ICW
4Th ICW 4Th ICW Problems associated with MLR • Colinearity between the independent variables (X) • Number of dependent variables (k) should be much lower than the number of samples (m) Reduced number of variables must be used
Feature selection The variables are selected based on their generalization ability using selection methods such as stepwise variable selection, genetic algorithm, simulated annealing,… Feature extraction The variables are transformed into new coordinate axes with lower dimension Principal Component Analysis (PCA) or Factor Analysis (FA) 4Th ICW 4Th ICW
4Th ICW 4Th ICW PCA or FA or PFA X = T P X (mk) T (mk) P (kk) T =[t1 t2 t3 t4 t5 … tk] Score PT=[pT1 pT2 pT3 pT4 pT5 … pTk] Loading =[1 2 3 4 5 … k] eigen-value 1 > 2 > 3 > 4 > 5 > …> k
Each vector of T or P is named eigen-vector or PC or factor i shows the amount of variances in the X matrix that is explained by the corresponding eigen-vectors (ti or pi) A reduced set of PCs is necessary to reproduce the original data matrix without losing significant information 4Th ICW 4Th ICW (mk) (mf) (fk)
4Th ICW 4Th ICW f is the number of significant factors f is the rank of the original data matrix f describes the complexity of the X matrix Ideally, f is the number of nonzero eigen-values f can be determined by the theory of FA Scree plot, indicator function, imbedded error, real error, …
MLR (Classical Least Squares) y = Xb b = (XTX)-1XTy ynew = xnewb Principal Component Regression (PCR) X = T P y = Tb b = (TTT)-1TTy tnew = xnewP ynew = tnew b 4Th ICW 4Th ICW PCA-Based regression method
4Th ICW 4Th ICW Some Questions • How many PCs must be used in PCR? • Which PCs should be considered in PCR modeling? • Is the magnitude of an eigen-value necessarily a measure of its significance for the calibration? Significance of factor selection
4Th ICW 4Th ICW Top-down eigen-value ranking(ER) Factors are entered to the model based on their decreasing eigen-value one after the other Once new factor is entered, the regression model is build and its performances are validated by the existing procedures such as cross-validation
4Th ICW 4Th ICW Top-down Correlation Ranking (CR) First the correlation between each one of the factors and the dependent variable (concentration, y) is determined Then, the factors are entered to the models based on their decreasing correlation consecutively.
4Th ICW 4Th ICW Other factor selection methods • Stepwise selection procedure • Search algorithms • Simulated annealing • Genetic algorithm
4Th ICW 4Th ICW Some references • Xie YL, Kalivas JH. Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta 1997;348: 19-27. • Sutter JM, Kalivas JH. Which principal components to utilize for principal component regression. J. Chemometrics 1992;6: 217-225. • Sun J. A correlation principal component regression analysis of NIR data. J. Chemometrics 1995;9: 21-29. • Depczynski U, Frost VJ, Molt K. Genetic algorithms applied to the selection of factors in principal component regression. Anal. Chim. Acta 2000;420: 217-227. • Barros AS, Rutledge DN. Genetic algorithm applied to the selection of principal components. Chemometrics Intell. Lab. Syst. 1998; 40: 65-81. • Verdu-Andres J, Massart DL. Comparison of prediction-and correlation-Based methods to select the best Subset of principal components for principal component regression and detect outlying objects. Appl. Spect. 1998; 52: 1425-1434. • Xie YL, Kalivas JH. Local prediction models by principal component regression. Anal. Chim. Acta 1997; 348: 29-38. • Ferre L. Selection of components in principal component analysis: a comparison of methods. Comput. Stat. Data Anal. 1995; 19: 669-682.
4Th ICW 4Th ICW A QSPR example • Quantitative Structure-Electrochemistry Relationship Study of Some Organic Compounds • Dependent variable • Half-wave reduction potential (E1/2)of 69 compounds • Independent variables • 1150 theoretical molecular descriptors calculated by DRAGON software
4Th ICW 4Th ICW Principal Component-Artificial Neural Network (PC-ANN) • ANN is a nonlinear non-parametric modeling method • Feature selection is more important for ANN • Feature selection-based ANN modeling is a complex procedure • Orthogonalization of the variables before introducing to the network substantially decreases the computational time and increases the overall performances of the ANN • PC-ANN is a feature extraction-based algorithm
4Th ICW 4Th ICW PC-GA-ANN Algorithm • Genetic Algorithm Applied to the selection of Factors in PC-ANN modeling, • The set of PCs selected by GA could model the structure-antagonist activity of the calcium channel blockers better than the ER procedure • B. Hemmateenejad, M. Akhond, R. Miri, M. Shamsipur, J. Chem. Inf,. Comput. Sci. 43 (2003) 1328. • How are the factors ranked based on their correlation coefficient in PC-ANN?
4Th ICW 4Th ICW CR-PC-ANN Algorithm • Correlation Ranking Procedure for factor selection in PC-ANN modeling, • The nonlinear relationship between each one of the PCs and the dependent variable (y) was modeled by separate ANN models. • It was found that the subset of PCs selected by CR was relatively the same as those selected by GA. Therefore the results of these factor selection procedures were similar • B. Hemmateenejad,Chemometrics Intelligent Laboratory System, 2004, Accepted.
Application of ab initio theory to QSAR study of the 1,4-dihydrpyridine-based calcium channel blockers using GA-MLR and PC-GA-ANN procedures, B. Hemmateenejad, M.A. Safarpour, R.Miri, F. Taghavi,Journal of Computational Chemistry 25 (2004) 1495. • Highly Correlating Distance-Connectivity-Based Topological Indices. 2: Prediction of 15 Properties of a Large Set of Alkanes Using a Stepwise Factor Selection-Based PCR Analysis, M. Shamsipur, R. Ghavami, B. Hemmateenejad, H. Sharghi, QSAR Combinatorial Sciences, 2004, Accepted. • Quantitative Structure-Electrochemistry Relationship Study of some Organic Compounds using PCR and PC-ANN, B. Hemmateenejad, M. Shamsipur,Internet Electronic Journal of Molecular Design 3 (2004) 316. • Toward an Optimal Procedure for PC-ANN Model Building: Prediction of the Carcinogenic Activity of a Large Set of Drugs, B. Hemmateenejad, M.A. Safarpour, R. Miri, N. Nesari, Journal of Chemical Information and Computer Sciences, Revised • Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based PCR, B. Hemmateenejad, Journal of Chemometrics, Submitted.
4Th ICW 4Th ICW Feature Works • Selection of Latent Variables in PLS • Application of other selection algorithms such as successive projections algorithm • Comparison between the importance of factor selection in multicomponent analysis and QSAR/QSPR studies • Application of the factor selection-based ANN modeling in multicomponent analysis • Validation of the different factor selection algorithms by new criteria