Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20

Non-Negative Matrix Factorizationfor Statistical AnalysisStan YoungPaul Fogel, Doug HawkinsNISSMPC Vienna11July 2007

Outline • Introduction • Robust singular value decomposition • Non-negative matrix factorization 4. Inference with NMF.

Data Blocks (Zoo) 3-Way PCA Linear Regression Canonical Correlation PLS “U” design Multi-Block

Multiple Blocks 2-way tables of data are ubiquitous, PCA. One response and a table of predictors is common, linear regression. Multiple 2-way tables are becoming important: Gene expression, proteomics, metabololomics.

Examples of Multiple Blocks …….. Factor analysis of multiple data matrices. Horst 1961, 1965, Kettering, J. 1971. Pittman, Sacks, Young. 2001. 3-Way Analysis. See also www.niss.org/PowerArray Martens. 2004. “U” Analysis

Motivating Problem Permute the rows and columns to find patterns. • Problems: • Large, 10s to 100s of rows and 1000s of columns. • Missing data. • Outliers.

Matrix Factroization Methods • Principle component analysis. • Singular value decomposition. • Non-negative matrix factorization. • Independent component analysis. • Inference using NMF. • Area of active research.

Understanding a SVD algorithm helps X = l * LHE ‘ * RHE + E = + E X y = bx + e • Guess at LHE. • Linear regression of LHE on column of Y. • Element of RHE is the regression coefficient. • Switch LRE and RHE, iterate. Alternating LS regression. • Use robust regression method. Least trimmed squares.

California Versus All Challengers,The 1999 Cabernet Challenge • 47 wines judged by 32 wine experts • No data for 1 wine/expert • One missing data point • Results are ranks of wine by each judge

Original Data The missing cell is colored yellow.

Plot of Eigenvalues The plot suggests one or two components.

Component 1 Judges are divided into the following groups: 1-3, 4-7, 8-11, 12-26, 27-32 Wines are divided into the following groups: 1-4, 5-17,18-27,28-41,42-46

1st EV System

Comments Wine Dataset Most judges were consistent. Three judges are at odds with the rest. The wines divided into 6 classes; six wines group very well. There is an apparent interaction of wines and judges. One eigen system captures most of the variance.

Key Matrix Factorization Papers • Good (1969) Technometrics – SVD. • Liu et al. (2003) PNAS – rSVD. • Lee and Seung (1999) Nature – NMF. • Kim and Tidor (2003) Genome Research. • Brunet et al. (2004) PNAS – Micro array. • Fogel et al. (2007) Bioinformatics.

Contention: NMF finds “parts” SVD RH EV elements come from a composite. (They come from regression.) NMF commits one vector to each mechanism. (True??) “For such databases there is a generative model in terms of ‘parts’ and NMF correctly identifies the ‘parts’.”

Genes or Compounds Samples A NMF Algorithm Green are the “spectra”. Red are the “weights”. H WH = W + E Optimize so that (aij – whij)2 is minimized. Start with random elements in red and green.

Original matrix = Prototypical flavor patterns X Weights Scotch Whisky Wishart: Whisky Classified

Examples: Lagavulin &Laphroig

Profile likelihood Scree plot Determinant How Many components?

Golub,T.R. et al. (1999) • Group AML: acute myeloid leukemia • Group ALL: acute lymphoblastic leukemia • Subgroup ALL-T: T cell subtypes • Subgroup ALL-B: B cell subtypes

Gene and Sample Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al. (2004). PNAS 101, 4164–4169

Immune Response 10 genes (p=0.00019) MHC class II 5 genes Cluster 1 ALL-B1 (33 genes) Proteasome 7 genes P = 0.00054 MHC class I & II 6 genes P = 0.00018 Immune Response 28 genes(p=0.00047) RNA Processing 11 genes P = 0.00260 Cluster 3 ALL-B2 (169 genes) DNA Repair and Replication 11 genes P = 0.01519 Cell Growth and Proliferation 61 genes Cell Cycle 12 genes Transcription 16 genes ALL-B1 and ALL-B2 Genes Upregulation in ALL-B2 genes Higher rate of transcription and replication processes More: Proliferative nature compared with ALL-B1 Proteasomal activity Energy production.

Inference Strategy Non-negative matrix factorization is used to group genes. [Unsupervised training.] The testing alpha is allocated over these groups/vectors. Within each group, genes are tested sequentially; there is no multiple testing adjustment!!!.

Inference NMF Algorithm H Y X W • Compute NMF. • 2. Order Y by elements of W. • 3. Compute runs test on Y. • 4. Remove most important col of X. • 5. Repeat steps 1 to 3 (maintain order of H). • 6. Stop when runs test not significant. Fogel et al. (2007) Bioinformatics

Simulation

Simulation Genes 1-5: up-regulated by T1 Genes 6-10: up-regulated by T2 Genes 11-20: up-regulated by T1 and T2 NB: Genes within a mechanism are expected to be correlated.

Increased Power

General Comments SVD is the basis for most linear statistical systems. Non-negative matrix factorization will become increasingly important. Data sets are getting much bigger. We are seeing complex, multi-block data sets. We need good software to expand data analysis.

irMF Summary • NMF is an attractive alternative to SVD. • Mechanisms appear to be captured in separate vectors. • Genes can be tested sequentially within a right vectors. • Many statistical problems are open for research.

More Information NMF program and papers at www.niss.org/irMF Stan Young : young@niss.org Paul Fogel : paul.fogel@wanadoo.fr

More Information NMF Code and papers at www.niss.org/irMF Analysis of “L” design: www.niss.org/PowerArray NMF roundtable luncheon at JSM2007. See also: www.niss.org/PowerMV http://eccr.stat.ncsu.edu/ young@niss.org

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20

Non-Negative Matrix Factorization for Statistical Analysis Stan Young Paul Fogel, Doug Hawkins NISS MPC Vienna 11July 20

Presentation Transcript

NON-NEGATIVE MATRIX FACTORIZATION FOR REAL TIME MUSICAL ANALYSIS AND SIGHT-READING EVALUATION

Non-Negative Matrix Factorization

Non-negative Matrix Factorization

Pre-processing HCS data using Non-negative Matrix Factorization

Non-negative matrix factorization with Gaussian process priors

Shifted Non-negative Matrix Factorization

Matrix Factorization

Initialization enhancer for non-negative matrix factorization

Non Negative Matrix Factorization

Non-negative Matrix Factorization with Sparseness Constraints

Tuning Pruning in Sparse Non-negative Matrix Factorization

Evaluation of Distance Metrics for Recognition Based on Non-Negative Matrix Factorization

Extensions of Non-Negative Matrix Factorization (NMF) to Higher Order Data

Illumination Estimation via Non-Negative Matrix Factorization

Position-dependent motif characterization using Non-negative matrix Factorization (NMF)

The Diagonalized Newton Algorithm for Non-negative Matrix Factorization

Matrix Factorization

Non-Negative Residual Matrix Factorization w/ Application to Graph Anomaly Detection

Matrix Factorization

Extensions of Non-negative Matrix Factorization to Higher Order data

Non-negative Matrix Factorization