300 likes | 434 Views
Reconstructing Biological Networks. Brian Haynes. Biological Networks. Multiple layers Metabolic Protein-Protein Protein-DNA Other considerations Environment Tissue. Metabolic. Protein-protein. Protein-DNA. Network Reconstruction. An analogy. ATTACGTGTGC TGCCGATT GTGCTGCCA.
E N D
Reconstructing Biological Networks Brian Haynes
Biological Networks • Multiple layers • Metabolic • Protein-Protein • Protein-DNA • Other considerations • Environment • Tissue Metabolic Protein-protein Protein-DNA
An analogy ATTACGTGTGC TGCCGATT GTGCTGCCA TGCCGATTACGTGTGCTGCCA Read assembly Genomic sequence Pathway inference Pathway
Data Integration • Genome • Promoter sequence • Peptide homology • Transcriptome • Microarrays • RNA-seq • Proteome • 2d gel / MALDI-MS • Tandem mass-spec • Metabolome • LC / CE MS High throughput Low throughput
Model Considerations • Interpretability • Data requirements • Steady state • Time course • Complexity • Descriptive or Predictive
Correlation based models • Measure expression over many conditions and genetic perturbations • Find correlations between genes that persist across samples • Filter indirect interactions
Calculating Expression Correlation • Pearson correlation: • Mutual Information: y x - Low correlation - High mutual information
Filtering Indirect Interactions • DPI (data processing inequality) • Used in ARACNE algorithm • Filter weak alternative paths between nodes • In the strict application the graph becomes a tree g1 a b c g3 g2 (Margolin, et al, Bioinformatics, 2006)
Filtering Indirect Interactions • Background correction • CLR (context likelihood relatedness) Z-Score: (Faith, et al, PLoS Biol, 2007)
Filtering Indirect Interactions • Background correction • KNN filtering (Used in Symmetric-N algorithm) • Filtered structures are scale free given a sufficient sized K • Algorithm: • Rank correlations for each node • If an edge shared by two nodes is in both nodes top K, then keep the edge K = 2 (Chen, et al, BMC Bioinformatics, 2008)
Assessing performance Performed in E-Coli using 445 microarrays using RegulonDB as the gold standard Precision: TP / ( TP + FP ) Recall: TP / (TP + FN ) (Faith, et al, PLoS Biol, 2007)
Correlation based models • Strengths • Fast to compute < 1 minute ( poly-time ) • Low complexity • Weaknesses • Undirected structure (no causal information) • Descriptive, not predictive • Difficult to validate / interpret model • Can only test at the structural level
Probabilistic Graphical Models • Graph • Vertices represent variables • Edges encode conditional dependence Given C, E is independent of its indirect ancestors A&B: Pr( E | C,A,B ) = Pr( E | C ) B A D C Pr( A,B,C,D,E ) = P(A) P(B) P(C | A,B) P(D) P(E | C, D) E
Bayesian Networks • Directed graphical models • Acyclic Conditional Probability Table: B A P(C | A,B) D C E
Dynamic Bayesian Networks • Intuitively unrolling of the static BN through time • Parameterization is time invariant B B C C E E T0 T1 T2 T3 T4 T5
Learning Bayesian Networks • Given observations and possibly prior knowledge, learn the structure and parameters of the BN • Goal: maximize the posterior P( M | D ) B A D C E Parameter Learning Structure Learning
Structure Learning MCMC sampling 0.9 0.1 0.4 P(x | a,b,c) Propose a network structure Find the optimal parameters for P(x | a,b,c) Accept or reject according to Metropolis-Hastings criterion
Probabilistic Graphical Models • Strengths • Inherently handle noise • Low complexity (under certain CDFs) • Can support time dependencies (DBN) • Dealing with hidden variables • Weaknesses • With DBN, time invariance doesn’t support all time course data (ie data with non even sample intervals) • Computationally intensive to learn optimal structure (NP-Complete) • Continuous time models are underdeveloped
Differential Equation Models • Attempt to reconstruct the dynamical system that produced the gene expression data • Reduce dimensionality of the data • Approximate dynamics • Modeled using ordinary differential equations • Restrict model complexity • Example system : The Inferelator
Dimensionality Reduction • Regulators (genes and environment) • Limited to transcription factors • Factors with correlated profiles are merged • Genes • Clustered based on putative coregulation • Used cMonkey to form biclusters across genes and conditions [Bonneau, 2006] • Correlated expression • Shared regulatory sequence motifs (Bonneau, et al, Genome Biology, 2006)
Model Details • Expression of y (gene or bicluster mean) is influenced by the expression of N regulators: X = (x1, x2, …, xN) (Bonneau, et al, Genome Biology, 2006)
Model Details Choice of Squashing Function (Bonneau, et al, Genome Biology, 2006)
Model Details Choice of Z: (Bonneau, et al, Genome Biology, 2006)
Model Details Steady state Time course (Bonneau, et al, Genome Biology, 2006)
Model Learning with LASSO • LASSO, a.k.a. L1 shrinkage S.T. (Bonneau, et al, Genome Biology, 2006)
Results • Measured NRC-1 under 24 novel conditions and predicted expression response Training Data Test Data (24 novel conditions) (Bonneau, et al, Genome Biology, 2006)
Differential Equation Models • Strengths • Predictive and hypothesis generating • Biological interpretation • Supports time course and steady state • Allows for uneven sampling of time course • Weaknesses • Arbitrary of the regulatory functions G and Z • Computationally intensive • Non-linearity causes problems in numerical optimization • Handling hidden variables
Final Thoughts • Unifying the probabilistic and dynamical systems approaches • Handling genes by type • Moving from descriptive to predictive • Sorting hypotheses by information gain