1 / 14

Journal Club Journal of Chemometrics May 2010

Journal Club Journal of Chemometrics May 2010. August 23, 2010. An efficient nonlinear programming strategy for PCA models with incomplete data sets Rodrigo López-Negrete de la Fuentea, Salvador García-Muñozb and Lorenz T. Biegler J. Chemometrics 2010; 24: 301–311. Questions addressed:.

malana
Download Presentation

Journal Club Journal of Chemometrics May 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Journal ClubJournal of ChemometricsMay 2010 August 23, 2010

  2. An efficient nonlinear programming strategy for PCA models with incomplete data sets Rodrigo López-Negrete de la Fuentea, Salvador García-Muñozb and Lorenz T. Biegler J. Chemometrics 2010; 24: 301–311

  3. Questions addressed: • How to obtain the parameters of PCA models in the presence of incomplete data sets based in non-linear programming strategy. • How nonlinear programming approach is better suited when there are large amounts of missing values.

  4. Methods • PCA with full data-set: • Given: where, T: Scores(projection) P:Loading Rx:Residulas Problem 1: Solution: the solution will be given by the largest eigenvalue of the covariance matrix of X. Largest eigen values: largest variance variance of XX’ Problem 2: Solution: Has the same form as that of the solution of the maximization Problem.

  5. Methods • PCA with full data-set: • Using SVD: Two problems have the same solution.

  6. Methods • PCA with full data-set: • Principal Components via NIPALS algorithm:

  7. Methods PCA with incomplete data-set: Taking gradient, wrt t and p

  8. Methods • PCA with incomplete data-set: • Principal Components via modified NIPALS algorithm:

  9. Methods PCA with incomplete data-set: - X is the matrix of data where the missing elements have been zeroed out. - Constraint (20b) forces the loadings to be orthonormal. - Constraint (20c) makes the score vectors orthogonal to each other. - Constraint (20d) forces the scores to have zero mean. - It is clear that if there are no missing values problem (20) will reduce to problem (4) (min problem) for the first a principal components. Let Yi,j = Xi,j + Zi,j where Xi,j are the values of the data that are equal to zero for the missing elements, and Zi,j are the imputed values that should be zero for the nonmissing elements.

  10. Methods PCA with incomplete data-set: constrained problem will be solved directly, the scores and loadings obtained with the NLP will be orthogonal as needed by the PCA model, which is not true for the modified NIPALS.

  11. RESULTS Numerical simulations were done by generating a data set with 1000 rows and 100 columns from a known four-dimensional latent space with added random Gaussian noise. Values were then removed to generate data sets with missing value percentages ranging from 1 to 70%.

  12. RESULTS

  13. RESULTS Industrial Example: - Data from 76 common pharmaceutical materials were made available (Pfizer Inc.) and the data span over 10 years of testing. Due to the reasons outlined above, approximately 61% of the data were missing. - For this example, three principal components were used in all models due to: a sudden drop in the eigenvalue for the fourth component (from 13 to 2) for the NIPALS model and the very low percent of the total variance for the fourth component (1.4%) in the NLP.

  14. Conclusion • The NLP solutions take less time and iterations than the current state-of the-art algorithms, while still satisfying the constraints of the PCA model. • The current platform allows the potential inclusion of a large number of observations that otherwise would be excluded from the model building exercise, still yielding a robust model with desirable properties. • In the presence of large amounts of missing data, this method reduces the computational time (and number of iterations) required to calculate them.

More Related