940 likes | 1.28k Views
THE JOHNS HOPKINS UNIVERSITY. 550.635 TOPICS IN BIOINFORMATICS. ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano.
E N D
THE JOHNS HOPKINS UNIVERSITY 550.635 TOPICS IN BIOINFORMATICS ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera and Andrea Califano. Reverse engineering of regulatory networks in human B cells Katia Basso, Adam Margolin, Gustavo Stolovitzky, Ulf Klein, Riccardo Dalla-Favera, and Andrea Califano http://www.cis.jhu.edu/~sanchez/ARACNE_635_sanchez.ppt Instructor:Dr. Donald Geman Student:Francisco Sánchez Vega Baltimore, 28 November 2006
Outline • A very short introduction to MRF • The ARACNE approach • Theoretical framework • Technical implementation • Experiments and results • Synthetic data • Human B cells (paper by Basso et. al) • Discussion
Xt0 Xt1 X(t-1) Xt X(t+1) … … P(Xt|Xt0,Xt1,…,X(t-1))=P(Xt|X(t-1)) " The world is a huge Markov chain " (Ronald A. Howard) A very short introduction to MRFs What is a Markov Random Field ? • Intuitive idea: concept of Markov chain
A very short introduction to MRFs What is a Markov Random Field ? • A Markov random field is a model of the joint probability distribution over a set X of random variables. • A generalization of Markov Chains to a process indexed by the sites of a graph and satisfying a Markov property to the neighborhood system.
A very short introduction to MRFs MRF: a constructive definition • Let X={X1,…,XN} be the set of variables (the stochastic process) whose JPD we want to model. • Start by considering a set of sites or vertices V={V1,…,VN}. • Define a neighborhood system: ={v, v V} where (1) v V (2) v v (3) v u u v
c=1 c=2 c=8 A very short introduction to MRFs MRF: a constructive definition • Example: “N times N” square lattice V={(i,j):1i,j N} (i,j) ={ (k,l)V: 1(i-k)2+(l-j)2 c}
c=1 c=2 A very short introduction to MRFs MRF: a constructive definition • At this point, we can build an undirected graph G=(V, E) • Each vertex vV is associated to one of the random variables in X • The set of edges E is given by the chosen neighborhood system.
c=2 c=1 A very short introduction to MRFs MRF: a constructive definition • Clique: induced complete subgraph
A very short introduction to MRFs MRF: a constructive definition • Imagine that each variable Xv in X={X1,…,XN} can take one of a finite number Lv of values: • Define the configuration space S as • We say that X is a MRF on (V,) if (a) P(X=x)>0, for all x S (b) P(Xv=xv|Xu=xu,uv)=P(Xv=xv|Xu=xu,uv) for vV
A very short introduction to MRFs MRF: a constructive definition c=1 c=2
A very short introduction to MRFs • Let us now associate a specific function, that we will call potential, to each clique c C, so that: • We say that a probability distribution p PS is a Gibbs distribution wrt (V,) if it is of the form: where Z is the partition function:
A very short introduction to MRFs Correspondence Theorem X is a MRF with respect to (V, ) if and only if p(x)=P(X=x) is a Gibbs distribution with respect to (V, ) (i.e. every given Markov Random Field has an associated Gibbs distribution and every given Gibbs distribution has an associated Markov Random Field).
Outline • A very short introduction to MRF • The ARACNE approach • Theoretical framework • Technical implementation • Experiments and results • Synthetic data • Human B cells (paper by Basso et. al) • Discussion
Microarray data The name of the game Theoretical framework Graphical model
Theoretical framework • Need of a strategy to deal with uncertainty under small samples: Maximum entropy models • Philosophical basis: • Model all that is known and assume nothing about that which is unknown. • Given a certain dataset, choose a model which is consistent with it, but otherwise as “uniform” as possible
The most “uniform” model is the one that maximizes the entropy H(A) Theoretical framework MaxEnt: toy example (adapted from A. Berger) • Paul, Mary, Jane, Brian and Emily are five grad students that work in the same research lab. • Let us try to model the probability of the discrete random variable X“first person to arrive at the lab in the morning”. p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1 • If we do not know anything about them: p(X=Paul) = 1/5 p(X=Mary) = 1/5 p(X=Jane) = 1/5 p(X=Bryan) = 1/5 p(X=Emily) = 1/5
Maximize H(P(X)) under the given constraints Theoretical framework MaxEnt: toy example (adapted from A. Berger) • Imagine that we know that Mary or Jane are the first ones to arrive 30% of the time. • In that case, we know: p(X=Paul)+p(X=Mary)+p(X=Jane)+p(X=Bryan)+p(X=Emily)=1 p(X=Mary)+p(X=Jane)=3/10 • Again, we may want to choose the most “uniform” model, although this time we need to respect the new constraint: p(X=Paul) = (7/10)(1/3) = 7/30 p(X=Mary) = (3/10)(1/2) = 3/20 p(X=Jane) = (3/10)(1/2) = 3/20 p(X=Bryan) = (7/10)(1/3) = 7/30 p(X=Emily) = (7/10)(1/3) = 7/30
Theoretical framework MaxEnt extensions ~ constrained optimization problem • S countable set • PS set of prob. measures on S, PS ={p(x): xS}. • set of K constraints for the problem • subset of PS that satisfies all the constraints in • Let {1,…,K} be a set of K functions SR[feature functions] • We choose these functions so that each given constraint can be expressed as: • If x are samples from a r.v. X, then E[i(X)]=i
Theoretical framework MaxEnt extensions ~ constrained optimization problem • Ex.Let x be a sample from X”person to arrive first in the lab” In this case, S={Paul, Mary, Jane, Bryan, Emily} • ={1}, 1p(X=Mary)+p(X=Jane)=3/10 • We can model 1 it as follows: 1 if x=Mary or x=Jane 0 otherwise Define 1(x) 1=3/10 So that
Theoretical framework MaxEnt extensions ~ constrained optimization problem Find p(x)=arg max p H(p(x)) • subset of PS that satisfies all the constraints in : Of course, since p(x) is a probability measure:
i H(p) Theoretical framework • Use Lagrange multipliers:
Theoretical framework • This leads to a solution of the form: But this is a Gibbs distribution, and therefore, by theorem, we can think in terms of the underlying Markov Random Field !! • We can profit from previous knowledge of MRF • We have found the graphical model paradigmthat we were looking for!
The name of the game Theoretical framework Technical implementation Microarray data Graphical model
Approximation of the interaction structure • Consider the discrete, binary case. First order constraints: X1 X5 X2 N X3 … X4
constraints Approximation of the interaction structure • Consider the discrete, binary case. Second order constraints:
Approximation of the interaction structure • Consider the discrete, binary case. For jth order: constraints • The higher the order, the more accurate our approximation will be… … provided we observe enough data !!
(from Elements of Statistical learning, by Hastie et Tibshirani) Approximation of the interaction structure (from Dr. Munos lectures, Master MVA) “…for M → ∞ (where M is sample set size) the complete form of the JPD is restored. In fact, M > 100 is generally sufficient to estimate 2-way marginals in genomics problems, while P(gi, gj, gk) requires about an order of magnitude more samples…”
Approximation of the interaction structure • Therefore, the model they adopt is of the form: • All genes for which ij= 0 are declared mutually non-interacting: • Some of these genes are statistically independent i.e. P(gi, gj) ≈ P(gi)P(gj)) • Some are genes that do not interact directly but are statistically dependent due to their interaction via other genes. i.e. P(gi, gj) ≠ P(gi)P(gj), but ij = 0 How can we extract this information ??
Approximation of the interaction structure • Therefore, the model they adopt is of the form: Pairs of genes for which the MI is a rightful indicator of dependency P(gi, gj) = P(gi)P(gj) ij=0 Pairs of genes who interact through a third gene Pairs of genes whose direct interaction is balanced out by a third gene
Mutual Information • The mutual information between two random variables is a measure of the amount of information that one random variable contains about another. • It is defined as the relative entropy (or Kullback-Leibler distance) between the joint distribution and the product of the marginals: • Alternative formulations:
Mutual Information estimation Another toy example P(X) P(Y) P(X=0)=6/10; P(X=1)=4/10; P(Y=0)=4/10; P(Y=1)=6/10; P(X=0,Y=0)=2/10 P(X=0,Y=1)=4/10 P(X=1,Y=0)=2/10 P(X=1,Y=1)=2/10 P(X,Y)
Mutual Information estimation Another toy example The Gaussian Kernel estimator
Mutual Information estimation Another toy example P(X,Y) f(x,y) f(x) f(y)
h’ = 4h h’’ = h/2 Mutual Information estimation Another toy example Reference (h=1)
Mutual Information estimation • Every pair of variables is copula-transformed: (X1,X2)[FX1(X1),FX2(X2)], where FX(x)=P(Xx) • By the Probability Integral Transform Theorem, we know that FX(X)~U[0,1]. • Thus, the resulting variables have range between 0 and 1, and marginals are uniform. • Transformation is one-to-one H and MI unaffected • Reduction of the influence of arbitrary transformations from microarray data preprocessing. • No need for position dependent kernel widths (h).
Mutual Information estimation • The choice of h is critical for the accuracy of the MI estimate, but not so important for estimation of MI ranks.
Technical implementation • At this point, we have a procedure to construct an undirected graph from our data. • In an “ideal world” (infinite samples, assumptions hold), we would be done. • Unfortunately, things are not so simple !
Technical implementation • Finite random samples • Possible higher-order interactions For each pair: • H0: the two genes are mutually independent (no edge) • HA: the two genes interact with each other (edge) • We reject the null hypothesis H0 (i.e. we “draw an edge”) when the MI between two genes is big enough. Need to choose a statistical threshold I0 Use hypothesis testing
Choice of the statistical threshold • Chosen approach: Random permutation analysis • Randomly shuffle gene expression values and labels
Choice of the statistical threshold • Chosen approach: Random permutation analysis • Randomly shuffle gene expression values and labels • The resulting variables are supposed to be mutually independent, but their MI will need not be equal to zero. • Each threshold I0 can then be assigned a p-value (which measures the probability of getting a MI value higher or equal to I0 just “by chance”). • Thus, by fixing a p-value, we obtain the desired I0.
I0 Choice of the statistical threshold • Chosen approach: Random permutation analysis Number of pairs MI values
Choice of the statistical threshold • Chosen approach: Random permutation analysis Parameter fitting: use known fact from large deviation theory 1 2 3
Technical implementation • If we have a real distribution of the type: ARACNe will get it wrong, because of our assumption =0 =0 =0 The extension of the algorithm to higher order interactions is a possible object of future research. Real network ARACNe’s output
Technical implementation • If we have a real distribution of the type: Iij=0 ARACNE will not identify the edge between gi and gj 0 However, this situation is considered “biologically unrealistic” ARACNe’s output Real network
Technical implementation • If we have a real distribution of the type: Iij 0 As it is, ARACNE will put an edge between gi and gj =0 The algorithm can be improved using the Data Processing Inequality ARACNe’s output Real network
The Data Processing Inequality • The DPI is a well known theorem within the Information Theory community. Let X,Y,Z be 3 random variables that form a Markov Chain X-Y-Z, then I(X;Y)I(X,Z) Proof. I(X; Y,Z) = I(X,Z) + I(X; Y|Z) = I(X,Y) + I(X; Z|Y) X and Z are conditionally independent given Y I(X; Z|Y)=0 Thus, I(X,Y) = I(X,Z) + I(X; Y|Z) I(X,Z), since I(X; Y|Z) 0 Similarly, we can prove I(Y;Z)I(X,Z) , and therefore: I(X,Z) min[I(X;Y) ,I(Y;Z)]
Mike Francisco Jean-Paul The Data Processing Inequality • A consequence of the DPI is that no transformation of Y, as clever as it can be, can increase the information that Y contains about X. (consider the MC of the form X-Y-[Z=g(Y)] ) • From an intuitive point of view:
gi 0.1 gj 0.3 0.2 gk The Data Processing Inequality • Going back to our case of study 0.1 0.3 0.2 0.1 0.2 0.3 0.1 0.2 0.3 So I assume that ij=0, even though P(gi,gj) P(gi)P(gj)
The Data Processing Inequality • But, what if the underlying network is truly a three-gene loop? …Then ARACNE will break the loop at the weakest edge! gi gj (known flaw of the algorithm) Philosophy: “An interaction is retained iff there exist no alternate paths which are a better explanation for the information exchange between two genes” gk Claim:In practice, looking at the TP vs. FP tradeoff, it pays to simplify
The Data Processing Inequality • Theorem 1. If MIs can be estimated with no errors, then ARACNE reconstructs the underlying interaction network exactly, provided this network is a tree and has only pairwise interactions. • Theorem 2. The Chow-Liu (CL) maximum mutual information tree is a subnetwork of the network reconstructed by ARACNE. • Theorem 3. Let πikbe the set of nodes forming the shortest path in the network between nodes i and k. Then, if MIs can be estimated without errors, ARACNE reconstructs an interaction network without false positive edges, provided: (a) the network consists only of pairwise interactions, (b) for each j πik, Iij≥ Iik. Further, ARACNE does not produce any false negatives, and the network reconstruction is exact iff (c) for each directly connected pair (ij) and for any other node k, we have Iij≥ min(Ijk, Iik).