360 likes | 501 Views
Inferring gene regulatory networks from multiple microarray datasets (Wang 2006). Tiffany Ko ELE571 Spring 2009. Outline. Introduction Gene Regulatory Networks DNA Microarrays Objectives Methods Approach: SVD GNR Algorithm Confidence Evaluation Results Simulated Data
E N D
Inferring gene regulatory networks from multiple microarray datasets (Wang 2006) Tiffany Ko ELE571 Spring 2009
Outline • Introduction • Gene Regulatory Networks • DNA Microarrays • Objectives • Methods • Approach: SVD • GNR Algorithm • Confidence Evaluation • Results • Simulated Data • Experimental Data • Discussion • Limitations • Conclusions
Gene Regulatory Networks http://upload.wikimedia.org/wikipedia/commons/0/07/Gene.png http://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Gene2-plain.svg/708px-Gene2-plain.svg.png
Gene Regulatory Networks http://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Gene_Regulatory_Network_2.jpg/800px-Gene_Regulatory_Network_2.jpg
Gene Regulatory Networks http://www.pnas.org/content/104/31/12890/F2.large.jpg
DNA Microarrays • Y-direction: genes X-direction: data points • M x N matrix S • M genes, N experiments • Expression (color magnitude) representative of the number of probes which have bound to present complementary DNA templates. • High number of genes, low number of samples/data points.
Objectives • Purpose • Construct a novel method of gene network reconstruction (GNR) which able to process a variety of multiple microarray datasets from difference experiments for inferring the most consistent gene network (GN) while taking into consideration sparsity of connections. • Motivation • Multiple datasets: addresses data scarcity and the “dimensionality problem” • Improve inferred gene network reliability • Derive gene networks with higher biologically plausible sparsity
Approach • Express Gene Networks (GN) as differential equations. • Derive a solution for a single time-course dataset using singular value decomposition (SVD). • Find the most consistent network structure with respect to all datasets. • Optimal solution has minimal connections (edges).
Approach • Express Gene Networks (GN) as differential equations. • Gene regulation dynamics typically nonlinear, however linear equations capture main features of the network.
Approach 2. Derive a solution for a single time-course dataset using singular value decomposition (SVD).
Approach • Derive a solution for a single time-course dataset using singular value decomposition (SVD). SVD: • nonzero elements of eklisted last, s.t. e1 = … = el, el+1 , … , en≠ 0. • Allows for particular solution with the smallest L2 norm for the connectivity matrix, Ĵ.
Approach • Derive a solution for a single time-course dataset using singular value decomposition (SVD).
Approach • Find the most consistent network structure with respect to all datasets. • Multiple, N, microarray datasets for one organism exists; each corresponds to its own general solution, J. • Jk is already normalized in time due to definition of X’. • LP problem posed:
Approach • Find the most consistent network structure with respect to all datasets. • Matching Term • Match most consistent solution with k’s solution • Weighted by reliability • Sparsity Term • Forces sparsity by minimizing the L1 norm • Relative importance balanced by Sparsity Term Matching Term
GNR Algorithm • When J is fixed, problem can be divided into N independent subproblems. • Through iteration, J will then be updated based on results of Y. • STEP 0: Initialize; set iteration index q = 1. • STEP 1: Fix J (q-1) • STEP 2: Fix J(q) • STEP 3: Check for convergence; else return to STEP 1.
Algorithm: Step 0 • Initialize: • Using SVD, solve for the particular solution • Set initial values: • Ensure given parameters are positive. • q = Iteration index, set
Algorithm: Step 1 • Update J: • At iteration q, with fixed, solve LP:
Algorithm: Step 2 & 3 • STEP 2: Having solved for , fix all of and solve for J(q): • STEP 3: Check for convergence. • Is ? • Yes Terminate computation. • No Return to STEP 1.
Confidence Evaluation • Given the optimal solution is , we can compute for each element Jij: • Variance • Deviation • Overall average deviation:
= 0 • 1 dataset • = 0.3 • 3 datasets True Network Results • = 0 • 2 datasets • = 0 • 3 datasets
Simulated Data • Constructed a small simulated network with five genes, and noise function (t): • Randomly chose 3 initial starting conditions. • Produced 3 datasets with 4, 4, and 3 time points, respectively.
Simulated Data • Assessed network recovering ability (Yeung 2002 criterion): • Assessed accuracy of GNR
Simulated Data No sparsity or noise factor Variant: # of data sets True Network • = 0, = 0 • 1 dataset • = 0, = 0 • 3 datasets • = 0, = 0 • 2 datasets
Simulated Data Gaussian noise distribution Variant: # of data sets, • = 0 • 1 dataset • = 0.3 • 3 datasets True Network • = 0 • 2 datasets • = 0 • 3 datasets
Simulated Data • Adding datasets improves accuracy of network reconstruction • GNR must balance between topology reconstruction accuracy and interaction strength accuracy. • controls the trade-off between E0 and E1 (or E2). • Adding datasets improves the confidence of network reconstruction.
Simulated Data • GNR is able to accurately infer the GN solution to a highly under-determined problem given datasets with few time points and differing initial conditions. • Network topology may still be correctly inferred in the presence of high noise by including a sparsity constraint at the expense of interaction strength accuracy. • Larger simulated network structures were tested with similarly effective results.
Experimental Data • Heat-Shock Response Data for Yeast • 10 transcription factors • 4 microarray datasets (Stanford Microarray Database) 7, 5, 5, 4 time points Correctly inferred 4 edges with documented, known regulation, and 1 edge with documented potential regulation.
Experimental Data • Cell-cycle Data for Yeast • 140 differentially expressed genes • 4 datasets with differing experimental conditions Constructed sub-GN involving several genes with proven function within cell wall organization. (Circles in same color indicate same biological function.)
Experimental Data • Stress Response Data for Arabidopsis • Root experiments: 226 genes; Shoot experiments: 246 genes • 9 datasets with 6+ time points for each root and shoot (www.arabidopsis.org)
Limitations • Assumes the regulatory network remains stationary regardless of differing environmental conditions. • Requires high resolution, high-quality, time-course datasets. • Noise of gene expression data intrinsic to microarray technologies is a major source of error. • Hidden regulatory factors may lead to implicit description errors. • Inferred GN models predict, indiscriminately, both direct and indirect regulations due to hidden variables. • Model edges correlate to net effect. • Predicted regulatory relationship does not inherently correlate to regulation by a transcriptional factor.
Conclusions • Created a novel method to derive GN substructure using multiple microarray datasets instead of multiple inferred network alignment. • Model can capture regulatory mechanisms at the protein and metabolite levels which cannot be physically measured. • Capable of deriving a more global structure with dense connections, in addition to more local substructures with sparse connections by modifying the trade-off parameter, . • Model is used most effectively in tandem with other information sources. • FUTURE WORK: Extend GNR to identify conserved network patterns or motifs from the datasets of differing species.
The End Thank you for listening!