350 likes | 593 Views
Pre-processing HCS data using Non-negative Matrix Factorization. S. Stanley Young National Institute of Statistical Sciences MBSW, Muncie 19May2009. WH. Contention: PCA fails for mixtures. NMF separates mixtures. Y 1. +. Y 2. =. WH. NMF. Y. Key Idea. Outline. Basics of HCS
E N D
Pre-processing HCS data using Non-negative Matrix Factorization S. Stanley Young National Institute of Statistical Sciences MBSW, Muncie 19May2009
WH Contention: PCA fails for mixtures. NMF separates mixtures.
Y1 + Y2 = WH NMF Y Key Idea
Outline • Basics of HCS • Non-negative matrix factorization • The experiment/simulation • NMF versus PCA • Analysis of experiment • Literature
Basic Experimental Setup • Multiple cells within a well. • Treat the wells. • Image each well. • Image analysis yields a vector for each cell. • Summarize the well. • Analyze the well summaries.
Typical Images Image analysis will produce a vector of numbers, 5-50, for each cell within each well. The cells are likely a mixture of responsive, non-responsive, cells along with artifacts of various sorts.
Typical Data • 5 vars/cell, 2000 wells/day, 2500 cells/well • 36 vars/well, 7,000 wells, 80-400 cells/well • 40 vars/well, 6,547 wells, 500 cells/well Data sets can be enormous, 7GB=>3MB.
Major Problem • Cells within wells are sub-samples. • We need a good well summary. • Idea: • Cluster the cells (within or across wells) • Summary: Proportions of each cell type • Average vectors for each type. • 3. Analysis of proportions and vectors.
Matrix Factorization Methods • Principle component analysis. • Singular value decomposition. • Non-negative matrix factorization. • Independent component analysis. • NMF is an area of active research.
WH = + E NMF Algorithm Green are the “spectra”. Red are the “weights”. Vars Cells Y Optimize so that (aij – whij)2 is minimized. Start with random elements in red and green.
Optimization Criteria Minimize (xij – whij)2 [xij log (xij / whij) + (Xij– whij)]
NMF Clustering • NMF Clusters the rows and columns. • Row clustering is fuzzy. • The variables in the column clusters define nature of each cluster. • The column factors are often sparse.
X Vars Samples WH Y = + E Treatments W X Junk Analysis Strategy (1)
X Vars Samples WH Y = + E Analysis Strategy (2) Trt 1 vs. Trt 2
Contention: NMF finds “parts” SVD RH EV elements come from a composite. (They come from regression.) NMF commits one vector to each mechanism. (True??) “For such databases there is a generative model in terms of ‘parts’ and NMF correctly identifies the ‘parts’.”
Simulated Data Set • Create Y a n x p, 1000 x 10 matrix. • Multiply random W (n x k )and H (k x p) matrices. • H is 40% sparse. • Y = WH where small, 5% of yij, Gausian noise is added. • We sample rows from Y to test NMF and PCA.
How many components? Large Drop 5 components
Linearity Test Exceeds U CL
Variables are clustered Cross correlation
NMF Summary • NMF honors the non-negative nature of the data. • Variables are grouped. • Samples are clustered. • The clustering is “fuzzy”. • Sparseness makes interpretation easier.
PCA Eigenvectors Comments EV1 All positive elements EV2 is a “contrast” EV3 is X01 vs X02. Junk!
PCA Summary • 2 or 3 components. • 1st component is general sum. • 2nd component is a contrast. • Variables do not group cleanly.
General Comments SVD is the basis for most linear statistical methods. PCA is terrible for mixtures. Where NMF can replace SVD, it will become increasingly important. NMF can be extended to complex, multi-block data sets. We need good software to make NMF accessible.
Matrix Factorization References • Good (1969) Technometrics – SVD. • Liu et al. (2003) PNAS – rSVD. • Lee and Seung (1999) Nature – NMF. • Brunet et al. (2004) PNAS – Micro array. • Fogel et al. (2007) Bioinformatics – Micro array.
HCS References Kümmel A, Gabriel D, Parker CN, Bender A. (2008) Computational methods to support high-content screening: from compound selection and data analysis to postulating target hypotheses. Expert Opin. Drug Discovery 4,1-9. Low J, Huang S, et al. (2008) High-content imaging characterization of cell cycle therapeutics through in vitro and in vivo subpopulation analysis. Mol Cancer Ther 7, 2455-2463. Young DW, Bender A, et al. (2008) Integrating high-content screening and ligand-target prediction to identify mechanism of action. Nature Chemical Biology 4, 59-68. Dürr O, Duval D, et al. (2007) Robust hit identification by quality assurance and multivariate data analysis of a high-content, cell-based assay. Journal of Biomolecular Screening 12, 1042-1049.
NMF Software • irMF: inferential, robust Matrix Factorization (JMP script)http://www.niss.org/irMF/ • Array Studio: Software package which provides state of the art statistics and visualization for the analysis of high dimensional quantification data (e.g. Microarray or Taqman data). OmicSoft Corporation www.omicsoft.com • BioNMF – free
Y X1 X2 X3 Future Work : Multi-block Find sets of co-varying variables. Relate sets of variables to outcomes. Find mutual support.
Co-Workers Stan Young, young@niss.org stan.young@omicsoft.com Paul Fogel, paul_fogel@hotmail.com George Luta, gl77@georgetown.edu Joe Maisog, bravas02@gmail.com
Useful Information Array Studio, www.omicsoft.com irMF, www.niss.org/irMF Google (BioNMF)
X Design User GUI Y Intensity A Annotation Script Vis/Stat Modules Software Architecture Array Studio“L” Data Structure (~600k lines of code, ~200 users at GSK)
Array Studio User Interface Search box Views View Controller ProjectExplorer Memory indicator Web details Details window