1 / 32

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 2007. Outline. Introduction Robust singular value decomposition Non-negative matrix factorization 4. Inference with NMF. Data Blocks (Zoo). 3-Way. PCA. Linear Regression.

berke
Download Presentation

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-Negative Matrix Factorizationfor Statistical AnalysisStan YoungPaul Fogel, Doug HawkinsNISSMPC Vienna11July 2007

  2. Outline • Introduction • Robust singular value decomposition • Non-negative matrix factorization 4. Inference with NMF.

  3. Data Blocks (Zoo) 3-Way PCA Linear Regression Canonical Correlation PLS “U” design Multi-Block

  4. Multiple Blocks 2-way tables of data are ubiquitous, PCA. One response and a table of predictors is common, linear regression. Multiple 2-way tables are becoming important: Gene expression, proteomics, metabololomics.

  5. Examples of Multiple Blocks …….. Factor analysis of multiple data matrices. Horst 1961, 1965, Kettering, J. 1971. Pittman, Sacks, Young. 2001. 3-Way Analysis. See also www.niss.org/PowerArray Martens. 2004. “U” Analysis

  6. Motivating Problem Permute the rows and columns to find patterns. • Problems: • Large, 10s to 100s of rows and 1000s of columns. • Missing data. • Outliers.

  7. Matrix Factroization Methods • Principle component analysis. • Singular value decomposition. • Non-negative matrix factorization. • Independent component analysis. • Inference using NMF. • Area of active research.

  8. Understanding a SVD algorithm helps X = l * LHE ‘ * RHE + E = + E X y = bx + e • Guess at LHE. • Linear regression of LHE on column of Y. • Element of RHE is the regression coefficient. • Switch LRE and RHE, iterate. Alternating LS regression. • Use robust regression method. Least trimmed squares.

  9. California Versus All Challengers,The 1999 Cabernet Challenge • 47 wines judged by 32 wine experts • No data for 1 wine/expert • One missing data point • Results are ranks of wine by each judge

  10. Original Data The missing cell is colored yellow.

  11. Plot of Eigenvalues The plot suggests one or two components.

  12. Component 1 Judges are divided into the following groups: 1-3, 4-7, 8-11, 12-26, 27-32 Wines are divided into the following groups: 1-4, 5-17,18-27,28-41,42-46

  13. 1st EV System

  14. Comments Wine Dataset Most judges were consistent. Three judges are at odds with the rest. The wines divided into 6 classes; six wines group very well. There is an apparent interaction of wines and judges. One eigen system captures most of the variance.

  15. Key Matrix Factorization Papers • Good (1969) Technometrics – SVD. • Liu et al. (2003) PNAS – rSVD. • Lee and Seung (1999) Nature – NMF. • Kim and Tidor (2003) Genome Research. • Brunet et al. (2004) PNAS – Micro array. • Fogel et al. (2007) Bioinformatics.

  16. Contention: NMF finds “parts” SVD RH EV elements come from a composite. (They come from regression.) NMF commits one vector to each mechanism. (True??) “For such databases there is a generative model in terms of ‘parts’ and NMF correctly identifies the ‘parts’.”

  17. Genes or Compounds Samples A NMF Algorithm Green are the “spectra”. Red are the “weights”. H WH = W + E Optimize so that (aij – whij)2 is minimized. Start with random elements in red and green.

  18. Original matrix = Prototypical flavor patterns X Weights Scotch Whisky Wishart: Whisky Classified

  19. Examples: Lagavulin &Laphroig

  20. Profile likelihood Scree plot Determinant How Many components?

  21. Golub,T.R. et al. (1999) • Group AML: acute myeloid leukemia • Group ALL: acute lymphoblastic leukemia • Subgroup ALL-T: T cell subtypes • Subgroup ALL-B: B cell subtypes

  22. Gene and Sample Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al. (2004). PNAS 101, 4164–4169

  23. Immune Response 10 genes (p=0.00019) MHC class II 5 genes Cluster 1 ALL-B1 (33 genes) Proteasome 7 genes P = 0.00054 MHC class I & II 6 genes P = 0.00018 Immune Response 28 genes(p=0.00047) RNA Processing 11 genes P = 0.00260 Cluster 3 ALL-B2 (169 genes) DNA Repair and Replication 11 genes P = 0.01519 Cell Growth and Proliferation 61 genes Cell Cycle 12 genes Transcription 16 genes ALL-B1 and ALL-B2 Genes Upregulation in ALL-B2 genes Higher rate of transcription and replication processes More: Proliferative nature compared with ALL-B1 Proteasomal activity Energy production.

  24. Inference Strategy Non-negative matrix factorization is used to group genes. [Unsupervised training.] The testing alpha is allocated over these groups/vectors. Within each group, genes are tested sequentially; there is no multiple testing adjustment!!!.

  25. Inference NMF Algorithm H Y X W • Compute NMF. • 2. Order Y by elements of W. • 3. Compute runs test on Y. • 4. Remove most important col of X. • 5. Repeat steps 1 to 3 (maintain order of H). • 6. Stop when runs test not significant. Fogel et al. (2007) Bioinformatics

  26. Simulation

  27. Simulation Genes 1-5: up-regulated by T1 Genes 6-10: up-regulated by T2 Genes 11-20: up-regulated by T1 and T2 NB: Genes within a mechanism are expected to be correlated.

  28. Increased Power

  29. General Comments SVD is the basis for most linear statistical systems. Non-negative matrix factorization will become increasingly important. Data sets are getting much bigger. We are seeing complex, multi-block data sets. We need good software to expand data analysis.

  30. irMF Summary • NMF is an attractive alternative to SVD. • Mechanisms appear to be captured in separate vectors. • Genes can be tested sequentially within a right vectors. • Many statistical problems are open for research.

  31. More Information NMF program and papers at www.niss.org/irMF Stan Young : young@niss.org Paul Fogel : paul.fogel@wanadoo.fr

  32. More Information NMF Code and papers at www.niss.org/irMF Analysis of “L” design: www.niss.org/PowerArray NMF roundtable luncheon at JSM2007. See also: www.niss.org/PowerMV http://eccr.stat.ncsu.edu/ young@niss.org

More Related