260 likes | 411 Views
Dependency networks. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 sroy@biostat.wisc.edu Nov 26 th , 2013. Goals for today. Introduction to Dependency networks GENIE3: A network inference algorithm for learning a dependency network from gene expression data
E N D
Dependency networks Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576 sroy@biostat.wisc.edu Nov 26th, 2013
Goals for today • Introduction to Dependency networks • GENIE3: A network inference algorithm for learning a dependency network from gene expression data • Comparison of various network inference algorithms
What you should know • What are dependency networks? • How they differ from Bayesian networks? • Learning a dependency network from expression data • Evaluation of various network inference methods
Graphical models for representing regulatory networks • Bayesian networks • Dependency networks Random variables encode expression levels Sho1 Msb2 Regulators X2 X1 X1 Ste20 Y3=f(X1,X2) X2 Y3 Target Y3 Structure Function Edges correspond to some form of statistical dependencies
Dependency network • A type of probabilistic graphical model • As in Bayesian networks has • A graph component • A probability component • Unlike Bayesian network • Can have cyclic dependencies Dependency Networks for Inference, Collaborative Filtering and Data visualization Heckerman, Chickering, Meek, Rounthwaite, Kadie 2000
Notation • Xi: ith random variable • X={X1,.., Xp}: set of p random variables • xik:An assignment of Xi in the kthsample • x-ik: Set of assignments to all variables other than Xi in the kthsample
Dependency networks Regulators • Function:fj can be of different types. • Learning requires estimation of each of the fj functions • In all cases it is trying to minimize an error of predicting Xj from its neighborhood: … ? ? ? fj Xj
Different representations of the fj function • If X is continuous • fj can be a linear function • fj can be a regression tree • fj can be a random forest • An ensemble of trees • If X is discrete • fj can be a conditional probability table • fj can be a conditional probability tree
Linear regression Y (output) Slope Intercept X (input) Linear regression assumes that output (Y) is a linear function of the input (X)
Estimating the regression coefficient • Assume we have N training samples • We want to minimize the sum of square errors between true and predicted values of the output Y.
An example random forest for predicting gene expression A selected path for a set of genes 1 Sox6>0.5 Input … Output Ensemble of Regression trees
Considerations for learning regression trees • Assessing the purity of samples under a leaf node • Minimize prediction error • Minimize entropy • How to determine when to stop building a tree? • Minimum number of data points at each leaf node • Depth of the tree • Purity of the data points under any leaf node
Algorithm for learning a regression tree • Input: Output variable Xj, Input variablesXj • Initialize tree to single node with all samples under node • Estimate • mc: the mean of all samples under the node • S: sum of squared error • Repeat until no more nodes to split • Search over all input variables and split values and compute S for possible splits • Pick the variable and split value that has the highest improvement in error
GENIE3: GEneNetwork Inference with Ensemble of trees • Solves a set of regression problems • One per random variable • Models non-linear dependencies • Outputs a directed, cyclic graph with a confidence of each edge • Focus on generating a ranking over edges rather than a graph structure and parameters Inferring Regulatory Networks from Expression Data Using Tree-Based Methods Van Anh Huynh-Thu, AlexandreIrrthum, Louis Wehenkel, Pierre Geurts, Plos One 2010
GENIE3 algorithm sketch • For each gene j, generate input/output pairs • LSj={(x-jk,xjk),k=1..N} • Use a feature selection technique on LSjsuch as tree buildingto compute wij for all genes i ≠ j • wij quantifies the confidence of the edge between Xi and Xj • Generate a global ranking of regulators based on each wij
GENIE3 algorithm sketch Figure from Huynh-Thu et al.
Feature selection in GENIE3 • Random forest to represent the fj • Learning the Random forest • Generate M=1000 bootstrap samples • At each node to be split, search for best split among K randomly selected variables • K was set to p-1 or (p-1)1/2
Computing the importance weight of each predictor • Feature importance is computed at each test node • Remember there can be multiple test nodes per regulator • For a test node importance is given by the reduction in variance if we make a split on that node Test node Set of data samples that reach the test node #S: Size of the set S Var(S): variance of the output variable in set S
Computing the importance of a predictor • For a single tree the overall importance is then sum over over all points in the tree where this node is used to split • For an ensemble the importance is averaged over all trees.
Computational complexity of GENIE3 • Complexity per variable • O(TKNlog N) • T is the number of trees • K is the number of random attributes selected per split • N is the learning sample size
Evaluation of network inference methods • Assume we know what the “right” network is • One can use Precision-Recall curves to evaluate the predicted network • Area under the PR curve (AUPR) curve quantifies performance
DREAM: Dialogue for reverse engineeting assessments and methods Community effort to assess regulatory network inference DREAM 5 challenge Previous challenges: 2006, 2007, 2008, 2009, 2010 Marbachet al. 2012, Nature Methods
Where do different methods rank? Random Community Marbach et al., 2010
Summary of network inference methods • Probabilistic graphical models provide a natural representation of networks • A lot of network inference is done using gene expression data • Many algorithms exist, we have seen three • Bayesian networks • Sparse candidates • Module networks • Dependency networks • GENIE3 • Algorithms can be grouped into per-gene and per-module