320 likes | 608 Views
Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 2007. Outline. Introduction Robust singular value decomposition Non-negative matrix factorization 4. Inference with NMF. Data Blocks (Zoo). 3-Way. PCA. Linear Regression.
E N D
Non-Negative Matrix Factorizationfor Statistical AnalysisStan YoungPaul Fogel, Doug HawkinsNISSMPC Vienna11July 2007
Outline • Introduction • Robust singular value decomposition • Non-negative matrix factorization 4. Inference with NMF.
Data Blocks (Zoo) 3-Way PCA Linear Regression Canonical Correlation PLS “U” design Multi-Block
Multiple Blocks 2-way tables of data are ubiquitous, PCA. One response and a table of predictors is common, linear regression. Multiple 2-way tables are becoming important: Gene expression, proteomics, metabololomics.
Examples of Multiple Blocks …….. Factor analysis of multiple data matrices. Horst 1961, 1965, Kettering, J. 1971. Pittman, Sacks, Young. 2001. 3-Way Analysis. See also www.niss.org/PowerArray Martens. 2004. “U” Analysis
Motivating Problem Permute the rows and columns to find patterns. • Problems: • Large, 10s to 100s of rows and 1000s of columns. • Missing data. • Outliers.
Matrix Factroization Methods • Principle component analysis. • Singular value decomposition. • Non-negative matrix factorization. • Independent component analysis. • Inference using NMF. • Area of active research.
Understanding a SVD algorithm helps X = l * LHE ‘ * RHE + E = + E X y = bx + e • Guess at LHE. • Linear regression of LHE on column of Y. • Element of RHE is the regression coefficient. • Switch LRE and RHE, iterate. Alternating LS regression. • Use robust regression method. Least trimmed squares.
California Versus All Challengers,The 1999 Cabernet Challenge • 47 wines judged by 32 wine experts • No data for 1 wine/expert • One missing data point • Results are ranks of wine by each judge
Original Data The missing cell is colored yellow.
Plot of Eigenvalues The plot suggests one or two components.
Component 1 Judges are divided into the following groups: 1-3, 4-7, 8-11, 12-26, 27-32 Wines are divided into the following groups: 1-4, 5-17,18-27,28-41,42-46
Comments Wine Dataset Most judges were consistent. Three judges are at odds with the rest. The wines divided into 6 classes; six wines group very well. There is an apparent interaction of wines and judges. One eigen system captures most of the variance.
Key Matrix Factorization Papers • Good (1969) Technometrics – SVD. • Liu et al. (2003) PNAS – rSVD. • Lee and Seung (1999) Nature – NMF. • Kim and Tidor (2003) Genome Research. • Brunet et al. (2004) PNAS – Micro array. • Fogel et al. (2007) Bioinformatics.
Contention: NMF finds “parts” SVD RH EV elements come from a composite. (They come from regression.) NMF commits one vector to each mechanism. (True??) “For such databases there is a generative model in terms of ‘parts’ and NMF correctly identifies the ‘parts’.”
Genes or Compounds Samples A NMF Algorithm Green are the “spectra”. Red are the “weights”. H WH = W + E Optimize so that (aij – whij)2 is minimized. Start with random elements in red and green.
Original matrix = Prototypical flavor patterns X Weights Scotch Whisky Wishart: Whisky Classified
Profile likelihood Scree plot Determinant How Many components?
Golub,T.R. et al. (1999) • Group AML: acute myeloid leukemia • Group ALL: acute lymphoblastic leukemia • Subgroup ALL-T: T cell subtypes • Subgroup ALL-B: B cell subtypes
Gene and Sample Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al. (2004). PNAS 101, 4164–4169
Immune Response 10 genes (p=0.00019) MHC class II 5 genes Cluster 1 ALL-B1 (33 genes) Proteasome 7 genes P = 0.00054 MHC class I & II 6 genes P = 0.00018 Immune Response 28 genes(p=0.00047) RNA Processing 11 genes P = 0.00260 Cluster 3 ALL-B2 (169 genes) DNA Repair and Replication 11 genes P = 0.01519 Cell Growth and Proliferation 61 genes Cell Cycle 12 genes Transcription 16 genes ALL-B1 and ALL-B2 Genes Upregulation in ALL-B2 genes Higher rate of transcription and replication processes More: Proliferative nature compared with ALL-B1 Proteasomal activity Energy production.
Inference Strategy Non-negative matrix factorization is used to group genes. [Unsupervised training.] The testing alpha is allocated over these groups/vectors. Within each group, genes are tested sequentially; there is no multiple testing adjustment!!!.
Inference NMF Algorithm H Y X W • Compute NMF. • 2. Order Y by elements of W. • 3. Compute runs test on Y. • 4. Remove most important col of X. • 5. Repeat steps 1 to 3 (maintain order of H). • 6. Stop when runs test not significant. Fogel et al. (2007) Bioinformatics
Simulation Genes 1-5: up-regulated by T1 Genes 6-10: up-regulated by T2 Genes 11-20: up-regulated by T1 and T2 NB: Genes within a mechanism are expected to be correlated.
General Comments SVD is the basis for most linear statistical systems. Non-negative matrix factorization will become increasingly important. Data sets are getting much bigger. We are seeing complex, multi-block data sets. We need good software to expand data analysis.
irMF Summary • NMF is an attractive alternative to SVD. • Mechanisms appear to be captured in separate vectors. • Genes can be tested sequentially within a right vectors. • Many statistical problems are open for research.
More Information NMF program and papers at www.niss.org/irMF Stan Young : young@niss.org Paul Fogel : paul.fogel@wanadoo.fr
More Information NMF Code and papers at www.niss.org/irMF Analysis of “L” design: www.niss.org/PowerArray NMF roundtable luncheon at JSM2007. See also: www.niss.org/PowerMV http://eccr.stat.ncsu.edu/ young@niss.org