Mixture Models on Graphs

Mixture Models on Graphs Guido Sanguinetti Department of Computer Science, University of Sheffield Joint work with Josselin Noirel and Phillip Wright, Chemical and Process Engineering, Sheffield

Basic question • Given high-throughput measurements comparing two conditions, identify groups of over-, under- and normal- expression. • Biological quantities are linked in complex networks of interactions. • Can we incorporate the network structure in our classifiers/ clustering algorithms?

Traditional approach • Use various statistical hypothesis-testing tools (t-statistics, p-vals, etc.). • More Bayesian (ish), model data as mixture model. • Key assumption is that the data is i.i.d. (see graphical model). • Many variations on theme (ciberT, PPLR,...) N c y

Network based approach y1 y2 • In practice we expect network structure to play a role: if many of your neighbours are overexpressed, you are more likely to be overexpressed. • Graphical model is different. • This allows to identify subnetworks with coherent expression patterns C1 y3 C2 X13 X24 C3 X23 C4 X35 X36 C5 y4 C6 y5 y6

Prior model • The graphical model suggests dependencies between the latent (class) variables. • We will encode these in conditional priors on the mixture coefficients. • Specifically where denotes the set of indices of nodes that are connected to the j-th node. Recently appeared other possibilities (CRFs, Spectral decompositions).

Class conditional model • We restrict to modelling log-expression ratios. • Three classes: overexpressed, underexpressed and no change • We model the no change class with a Gaussian • The other two classes have longer tails and are modelled with exponential distributions and similarly for underexpressed.

Parameters and hyper-parameters • Normal variance  set by user. • Exponential parameters are given an improper prior • This is equivalent to making no assumption on the ’s.

Conditional posteriors • Conditional posteriors can be obtained analytically for both class membership and exponential parameters, and they are given by where N is the number of elements in class  and I is the set of indices corresponding to class .

Gibbs sampling • Conditional posteriors are easy to sample from. • A Gibbs sampling scheme can be devised easily. • Gibbs sampling is a particular form of the Metropolis-Hastings Markov-Chain Monte Carlo scheme where the proposal distribution is the conditional posterior. • As a consequence, no rejections are needed.

Monitoring convergence • Not an expert (I’d like to hear from one!) • Standard textbook technique: run parallel chains and control mixing (e.g. Gelman, Carlin, Rubin and Stern). • Burn in period. • Thinning. • Result shown used a burn-in of 1000 iterations and a thinning of five.

Synthetic results • Generated random scale free network using Barabasi-Albert algorithm. • Network has 100 nodes and average connectivity of ~2. • Isolated nodes are removed. • Classes are generated from the conditional priors running a Markov chain to remove initial bias. • Data is generated from the conditional model.

Synthetic results 12 5 28 16 8 0 10 Left: MMG (blue) vs ordinary mixture model. Each point is a different random network, with ten random data assignments.

Real data (prelim) • E. coli reaction to oxygen exposure (Partridge et al.,2007) • Network structure given by transcriptional regulation network • Network weights given by regulatory strengths inferred using state-space model • Large overlap among classes • Biological significance still to be investigated

Future directions • Use on metabolic data (original motivation) • Temporal structures? • Directed graphs? • Any more questions?

Mixture Models on Graphs