410 likes | 561 Views
Module Networks. Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data. Cohen Jony. Outline. The Problem Regulators Module Networks Learning Module Networks Results Conclusion. Into:. The Problem.
E N D
Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony
Outline • The Problem • Regulators • Module Networks • Learning Module Networks • Results • Conclusion
Into: The Problem • Inferring regulatory networks from gene expression data. From:
Regulators example This is an example for a regulating module.
Known solution: Bayesian Networks The problem: Too many variables and too little data cause statistical noise to lead to spurious dependencies, resulting in models that significantly over fit the data.
Module Networks • We assume that we are given a domain of random variables X = {X1; : : : ;Xn}. • We use Val(Xi) to denote the domain of values of the variable Xi. • A module set C is a set of such formal variables M1; : : : ;MK. As all the variables in a module share the same CPD. • Note that all the variables must have the same domain of values!
Module Networks • A module network template T = (S; θ) for C defines, for each module Mj in C: 1) a set of parents PaMj from X; 2) a conditional probability template (CPT) P( Mj | PaMj ) which specifies a distribution over Val (Mj ) for each assignment in Val (PaMj ). • We use S to denote the dependency structure encoded by {PaMj : Mj in C} and θ to denote the parameters required for the CPTs {P( Mj | PaMj ) : Mj in C}.
Module Networks • A module assignment function for C is a function A : X → {1; : : : ;K} such that A(Xi) = j only if Val (Xi) = Val (Mj ). • A module network is defined by both the module network template and the assignment function.
Example • In our example, we have three modules M1, M2, and M3. • PaM1 = Ø , PaM2 = {MSFT}, and PaM3 = {AMAT; INTL}. • In our example, we have that A(MSFT) = 1, A(MOT) = 2, A(INTL) = 2, and so on.
Learning Module Networks • The iterative learning procedure attempts to search for the model with the highest score by using the expectation Maximization (EM) algorithm. • An important property of the EM algorithm is that each iteration is guaranteed to improve the likelihood of the model, until convergence to a local maximum of the score. Each iteration of the algorithm consists of two steps: M-step E-step
Learning Module Networks cont. M-step • In the , the procedure is given a partition of the genes into modules and learns the best regulation program (regression tree) for each module. • The regulation program is learned via a combinatorial search over the space of trees. • The tree is grown from the root to its leaves. At any given node, the query which best partitions the gene expression into two distinct distributions is chosen, until no such split exists.
Learning Module Networks cont. E-step • In the , given the inferred regulation programs, we determine the module whose associated regulation program best predicts each gene’s behavior. • We test the probability of a gene’s measured expression values in the dataset under each regulatory program, obtaining an overall probability that this gene’s expression profile was generated by this regulation program. • We then select the module whose program gives the gene’s expression profile the highest probability, and re-assign the gene to this module. • We take care not to assign a regulator gene to a module in which it is also a regulatory input.
Bayesian score • When the priors satisfy the assumptions above, the Bayesian score decomposes into local module scores: • Where…
Bayesian score cont. • Where Lj(U,X, ӨMj:D) is the Likelihood function . • Where P(ӨMj | Sj =u) is the Priors. • Where Sj = U denotes that we chose a structure where U are the parents of module Mj. • Where Aj = X denotes that A is such that Xj = X.
Assumptions • Let P(A), P(S | A), P(Ө | S,A) be assignment, structure, and parameter priors. • P(Ө | S,A) satisfies parameter independenceif • P(Ө | S,A) satisfies parameter modularityif for all structures S1 and S2 such that
Assumptions • P(Ө, S | A) satisfies assignment independenceif P(Ө | S, A) = P(Ө | S) and P(S | A) = P(S). • P(S) satisfies structure modularityif where Sj denotes the choice of parents for module Mj , and ρj is a distribution over the possible parent sets for module Mj. • P(A) satisfies assignment modularityif where Aj is the choice of variables assigned to module Mj, and {αj : j = 1; : : : ;K} is a family of functions from 2^X to the positive reals.
Assumptions - Explainations • Parameter independence, parameter modularity, and structure modularity are the natural analogues of standard assumptions in Bayesian network learning. • Parameter independence implies that P(Ө | S, A) is a product of terms that parallels the decomposition of the likelihood, with one prior term per local likelihood term Lj. • Parameter modularity states that the prior for the parameters of a module Mj depends only on the choice of parents for Mj and not on other aspects of the structure. • Structure modularity implies that the prior over the structure S is a product of terms, one per each module.
Assumptions - Explainations • These two assumptions are new to module networks. • Assignment independence: makes the priors on the parents and parameters of a module independent of the exact set of variables assigned to the module. • Assignment modularity: implies that the prior on A is proportional to a product of local terms, one corresponding to each module. • Thus, the reassignment of one variable from one module Mi to another Mj does not change our preferences on the assignment of variables in modules other than i; j.
Experiments • The network learning procedure was evaluated on synthetic data, gene expression data, and stock market data. • The data consisted solely of continuous values. As all of the variables have the same domain, the definition of the module set reduces simply to a specification of the total number of modules. • Beam search was used as the search algorithm, using a look ahead of three splits to evaluate each operator. • As a comparison, Bayesian networks were used with precisely the same structure learning algorithm, simply treating each variable as its own module.
Synthetic data • The synthetic data was generated by a known module network. • The generating model had 10 modules and a total of 35 variables that were a parent of some module. From the learned module network, 500 variables where selected, including the 35 parents. • This procedure was run for training sets of various sizes ranging from 25 instances to 500 instances, each repeated 10 times for different training sets.
Synthetic data - results • Generalization to unseen test data, measuring the likelihood ascribed by the learned model to4500 unseen instances. • As expected, models learned with larger training sets do better; but, when run using the correct number of 10 modules, the gain of increasing the number of data instances beyond 100 samples is small. • Models learned with a larger number of modules had a wider spread for the assignments of variables to modules and consequently achieved poor performance.
Synthetic data – results cont. • For all training set sizes, except 25, the model with 10 modules performs the best. Log-likelihood per instance assigned to held-out data.
Synthetic data – results cont. Fraction of variables assigned to the largest 10 modules. • Models learned using 100, 200, or 500 instances and up to 50 modules assigned 80% of the variables to 10 modules.
Synthetic data – results cont. Average percentage of correct parent-child relationships recovered. • The total number of parent-child relationships in the generating model was 2250. • The procedure recovers 74% of the true relationships when learning from a dataset of size 500 instances.
Synthetic data – results cont. • As the variables begin fragmenting over a large number of modules, the learned structure contains many spurious relationships. • Thus in domains with a modular structure, statistical noise is likely to prevent overly detailed learned models such as Bayesian networks from extracting the commonality between different variables with a shared behavior.
Gene Expression Data • Expression data which measured the response of yeast to different stress conditions was used. • The data consists of 6157 genes and 173 experiments. • 2355 genes that varied significantly in the data were selected and learned a module network over these genes. • A Bayesian network was also learned over this data set.
Candidate regulators • A set of 466 candidate regulators was compiled from SGD and YPD. • Both transcriptional factors and signaling proteins that may have transcriptional impact. • Also included genes described to be similar to such regulators. • Excluded global regulators, whose regulation is not specific to a small set of genes or process.
Gene Expression reasults • The figure demonstrates that module networks generalize much better then Bayesian network to unseen data for almost all choices of number of modules.
Biological validity • Biological validity of the learned module network with 50 modules was tested. • The enriched annotations reflect the key biological processes expected in our dataset. • For example, the “protein folding” module contains 10 genes, 7 of which are annotated as protein folding genes. In the whole data set, there are only 26 genes with this annotation. Thus, the p-value of this annotation, that is, the probability of choosing 7 or more genes in this category by choosing 10 random genes, is less than 10^-12. • 42 modules, out of 50, had at least one significantly enriched annotation with a p-value less than 0.005.
Biological validity Cont. • The enrichment of both HAP4 motif and STRE, recognized by Hap4 and Msn4, respectively, supporting their inclusion in the module’s regulation program. • Lines represent 500 bp of genomic sequence located upstream to the start codon of each of the genes; colored boxes represent the presence of cis-regulatory motifs locates in these regions.
Stock Market Data • NASDAQ stock prices for 2143 companies, covering 273 trading days. • stock → variable, instance → trading day. • The value of the variable is the log of the ratio between that day’s and the previous day’s closing stock price. • As potential controllers, 250 of the 2143 stocks, whose average trading volume was the largest across the dataset were selected.
Stock Market Data • Cross validation is used to evaluate the generalization ability of different models. • Module networks perform significantly better than Bayesian networks in this domain.
Stock Market Data Module networks compared with Autoclass • Significant enrichment for 21 annotations, covering a wide variety of sectors where found. • In 20 of the 21 cases, the enrichment was far more significant in the modules learned using module networks compared to the one learned by AutoClass.
Conclusions • The results show that learned module networks have much higher generalization performance than a Bayesian network learned from the same data. • Parameter sharing between variables in the same module allows each parameter to be estimated based on a much larger sample, this allows us to learn dependencies that are considered too weak based on statistics of single variables. (these are well-known advantages of parameter sharing); • An interesting aspect of the method is that it determine automatically which variables have shared parameters.
Conclusions • The assumption of shared structure significantly restricts the space of possible dependency structures, allowing us to learn more robust models than those learned in a classical Bayesian network setting. • In module network, a spurious correlation would have to arise between a possible parent and a large number of other variables before the algorithm would introduce the dependency.
Literature • Reference: Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data. By: Eran Segal, Michal Shapira, Aviv Regev, Dana Pe’er, David Botstein, Daphne Koller & Nir Friedman. Bibliography: • P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. Autoclass: a Bayesian classification system. In ML ’88. 1988.