300 likes | 516 Views
Stochastic Block Models of Mixed Membership. Edo Airoldi 1,2 , Dave Blei 2 , Steve Fienberg 1 , Eric Xing 1 1 Carnegie-Mellon University & 2 Princeton University. SAMSI, High Dimensional Inference and Random Matrices, September 17 th , 2006. Interaction graphs. Expression graphs.
E N D
Stochastic Block Models of Mixed Membership Edo Airoldi 1,2, Dave Blei 2, Steve Fienberg 1, Eric Xing 1 1 Carnegie-Mellon University & 2 Princeton University SAMSI, High Dimensional Inference and Random Matrices, September 17th, 2006
Interaction graphs Expression graphs The Scientific Problem • Protein-protein interactions in Yeast • Different studies test protein interactions with different technologies (precision)
The Data: Interaction Graphs • M proteins in a graph (nodes) • M2 observations on pairs of proteins • Edges are random quantities, Y [n,m] • Interactions are not independent • Interacting proteins form a protein complex • T graphs on the same set of proteins • Partial annotations for each protein, X [n] M = 871 nodes M2 = 750K entries
The Scientific Problems • What are stable protein complexes? • They perform many cellular processes • A protein may be a member of several ones • How many are there? • How do stable protein complexes interact? • Test hypotheses (inform new analyses) • Learn complex-to-complex interaction patterns
More Network Data Disease Spread Electronic Circuit Food Web Internet Social Network
An Abstraction of the Data • A collection of unipartite graphs: G1:T = (Y1:T ,N ) • Integer, real, multivariate edge weights:Yt = { Yt [nm] : n,m N } • Node-specific (multivariate) attributes: X1:T = { Xt [n] : n N } • Partially observable Y1:T and X1:T
The Challenge • Given the data abstraction and the goals of the analysis • Can we posit a rich class of models that is instrumental for thinking about the scientific problems we face? Amenable to theoretical analyses?
Modeling Ideas • Hierarchical Bayes • Latent variables encode semantic elements • Assume structure on observable-latent elements • Combination of 2 class of models 1. Models of mixed membership 2. Network models (block models) = Stochastic block models of mixed membership
Graphical Model Representation Stochastic Blocks Mixed Membership
Group-to-group patterns (latent*) Interactions (observed*) Mixed membership Vectors (latent*) 1 2 3 h i j j g i 23 = 0.9 1 2 3 1 2 3 T Pr ( yij=1 | i,j, ) = i j yij = 1 A Hierarchical Likelihood
More Modeling Issues • Technical :: Sparsity • Introduce parameter that modulates the relative importance of ones and zeros (binary edges) in the cost function that drives the clustering • Biological :: Ribosomes & Distress • Some protein complexes act like hubs because they are involved, e.g., in protein production or cell recovery (Y2H technology is invasive)
Large Scale Computation • Masses of data • 750K observations in a small problem (M=871) • 2.5M observations with (M=1578) • 3M expressions for 6K genes/proteins in Yeast • Variational inference [ Jordan et al., 2001 ] • Naïve implementation does not work • We develop a novel “nested” variational algorithm
Example: A Scientific Question • Do PPI contain information about functions? Model Approximate Posterior on Membership Vectors ? YLD014W Raw data Functional Annotations
1 0 1 2 3 . . . 15 Interactions in Yeast (MIPS) • Do PPI contain information about functions? YLD014W
Results: Identifiability • In this example we map latent groups to known functional categories Known Annotations Unknown Annotations
Mixed membership Results: Mixed Membership • The estimated membership vectors support the mixed membership assumption
General Bayesian Formulation • Assumptions for unipartite graphs • Population: existence of K sub-populations • Latent variable: mixed memb. vectors [n] ~ D • Subject: exchangeable edges given blocks & memb.Y[nm] ~ f ( . | [n] [m]) • Sampling scheme: the graphs are IID • Additional data, e.g., attributes, annotations • Integrated model formulation (descriptive/predictive) T
Variational Algorithms • Nested algorithm: • init (i i) • while (≈ log-lik )loop ij • init ij • while (≈ log-lik )update ij partially update (i,j) • Naïve algorithm: • init (i i,ij ij) • while (≈ log-lik )update (ijij)update(i i) We trade space for time but …
Variational Algorithms for MMSB Nested Nested Naïve Naïve On a single machine* we empirically observed: faster convergence (offsets extra computation), and more stable paths to convergence.
Take Home Points • Bayesian formulation is integral to the biology • A novel class of models that combines MM for soft-clustering & network models for dependent data • Latent aspects patterns that correlate with, help predict, functional processes in the cell • Current implementation allows for fast inference on large matrices through variational approximation considerable opportunity to improve upon both computation and efficiency of the approximation
Data & Problems: Gavin et al. (2002) Nature; Ho et al. (2002) Nature; Mewes et al. (2004) Nucleic Acids Research; Krogan et al. (2006) Nature. • Mixed Membership Models • Pritchard et al. (2000); Erosheva (2002); Rosenberg et al. (2002); Blei et al. (2003); Xing et al. (2003ab); Erosheva et al. (2004); Airoldi et al. (2005); Blei & Lafferty (2006); Xing et al. (2006) • Stochastic network models • Wasserman et al. (1980, 1994, 1996); Fienberg et al. (1985); Frank & Strauss (1986); Nowicki & Snijders (2001); Hoff et al. (2002), Airoldi et al. (2006) • More material on the Web at: http://www.cs.cmu.edu/~eairoldi/ • ICML Workshop on “Statistical Network Analysis: Models, Issues and New Directions” on June 29 at Carnegie Mellon, Pittsburgh PA: http://nlg.cs.cmu.edu/