Acknowledgements

BioinformaticsDealing with expression dataKristel Van Steen, PhD, ScD(kristel.vansteen@ulg.ac.be)Université de Liege - Institut Montefiore2008-2009

Acknowledgements Material based on: Slides from PatrikD’haeseleer, Shoudan Liang and Roland Somogyi (genetic network inference) Slides from Steve Horvath and Jun Dong (co-expression networks) Slides from Sargur Srihari (bagging and boosting)

Class Outline • Genetic networks • A primer to co-expression network analysis • Bagging and boosting (as promised …) • Concensus microarray data analysis • Theory • Application

Genetic networks

Outline • Introduction • A conceptual approach to complex network dynamics • Inference of regulation through clustering of gene expression data • Modeling methodologies • Gene network inference: reverse engineering

Genes encode proteins, some of which in turn regulate other genes determine the structure of this intricate network of genetic regulatory interactions

Traditional approach: local • Examining and collecting data on a single gene, a single protein or a single reaction at a time functional genomics

Functional Genomics • Specifically, functional genomics refers to the development and application of global experimental approaches to assess gene function by making use of the information and reagents provided by structural genomic. • high throughput • large scale experimental methodologies combined with statistical and computational analysis of the results.

Functional Genomics(Cont.) • We need to define the mapping from sequence space to functional space.

Intermediate representation • Focus at the level of single cells • A biological system can be considered to be a state machine,where the change in internal state of the system depends on both its current internal state and any external inputs.

The goal • Observe the state of a cell and how it changes under different circumstances, and from this to derive a model of how these state changes are generated • The state of cell • All those variables determining its behavior

Example • A simple,6-node regulatory network

Outline • Introduction • A conceptual approach to complex network dynamics • Inference of regulation through clustering of gene expression data • Modeling methodologies • Gene network inference:reverse engineering • Conclusions and Outlook

The global gene expression pattern is the result of the collective behavior of individual regulatory pathways • Gene function depends on its cellular context; thus understanding the network as a whole is essential.

Boolean Networks • Each gene is considered as a binary variable—either ON or OFF—regulated by other genes through logical or Boolean functions. • Even with this simplification ,the network behavior is already extremely rich.

Boolean Networks(Cont.) • Cell differentiation corresponds to transitions from one global gene expression pattern to another.

Scoring methods • Whether there has been a significant change at any one condition • Whether there has been a significant aggregate change over all conditions • Whether the fluctuation pattern shows high diversity according to Shannon entropy

Guilt By Association • Select a gene • Determine its nearest neighbors in expression space within a certain user-defined distance cut-off

Clustering • extract groups of genes that are tightly co-expressed over a range of different experiments.

Caution • Different clustering methods can have very different results • It’s not yet clear which clustering methods are most useful for gene expression analysis.

Definition:Gene Expression Profile • An expression profile ej of an ordered list of N samples(k=1 to N) for a particular gene j is a vector of scaled expression values vjk • The expression profile is: • ej=(vj1,vj2,vj3,…,vjN)

Definition:Gene Expression Profile( Cont.) • A difference between two genes p and q may be estimated as N-dimensional metric “distance” between ep and eq. • Euclidean distance: = =

Clustering algorithms • Non-hierarchical methods • Cluster N objects into K groups in an iterative process until certain goodness criteria are optimized • E.g. K-means

Clustering algorithms • Hierarchical methods • Return an hierarchy of nested clusters, where each cluster typically consists of the union of two or more smaller clusters. • Agglomerative methods • Start with single object clusters and recursively merge them into larger clusters • Divisive methods • Start with the cluster containing all objects and recursively divide it into smaller clusters

Other applications of co-expression clusters • Extraction of regulatory motifs • Genes in the same expression share biological funtions • Inference of functional annotation • Functions of unknown genes may be hypothesized from genes with know function within the same cluster • As a molecular signature in distinguishing cell or tissue types • mRNA expression

Which clustering method to use? • There is no single best criterion for obtaining a partition because no precise and workable definition of ‘cluster’ exists. • Clusters can be of any arbitrary shapes and sizes in a multidimensional pattern space.

Challenge in cluster analysis • A gene could be a member of several clusters, each reflecting a particular aspect of its function and control • Solutions • clustering methods that partition genes into non-exclusive clusters • Several clustering methods could be used simultaneously

Level of biochemical detail • abstract • Boolean networks • concrete • Full biochemical interaction models with stochastic kinetics in Arkin et al.(1998)

Forward and inverse modeling • Forward modeling approach • Inverse modeling, or reverse engineering • Given an amount of data, what can we deduce about the unknown underlying regulatory network? • Requires the use of a parametric model, the parameters of which are then fit to the real-world data.

Goal of network inference • Construct a coarse-scale model of the network of regulatory interactions between the genes • It’s possible to reverse engineer a network from its activity profiles

Data requirements • We need to observe the expression of that gene under many different combinations of expression levels of its regulatory inputs • Use data from different sources • Deal with different data types

Estimates for network models • a sparse network model of N genes, where each gene is only affected by K other genes on average. a sparsely connected, directed graph with Nnodes and NK edges.

Co-expression network analysis

Outline • Network and network concepts • Approximately factorizable networks • Gene Co-expression Network • EigengeneFactorizability, Eigengene Conformity • Eigengene-based network concepts • What can we learn from the geometric interpretation?

Network=Adjacency Matrix • A network can be represented by an adjacency matrix, A=[aij], that encodes whether/how a pair of nodes is connected. • A is a symmetric matrix with entries in [0,1] • For unweighted network, entries are 1 or 0 depending on whether or not 2 nodes are adjacent (connected) • For weighted networks, the adjacency matrix reports the connection strength between node pairs • Our convention: diagonal elements of A are all 1.

Motivational example I:Pair-wise relationships between genes across different mouse tissues and genders Challenge: Develop simple descriptive measures that describe the patterns. Solution: The following network concepts are useful: density, centralization, clustering coefficient, heterogeneity

Motivational example (continued) Challenge: Find a simple measure for describing the relationship between gene significance and connectivity Solution: network concept called hub gene significance

Backgrounds • Network concepts are also known as network statistics or network indices • Examples: connectivity (degree), clustering coefficient, topological overlap, etc • Network concepts underlie network language and systems biological modeling. • Dozens of potentially useful network concepts are known from graph theory.

Review of somefundamental network concepts which are defined for all networks (not just co-expression networks)

Connectivity • Node connectivity = row sum of the adjacency matrix • For unweighted networks=number of direct neighbors • For weighted networks= sum of connection strengths to other nodes

Density • Density= mean adjacency • Highly related to mean connectivity

Centralization = 1 if the network has a star topology = 0 if all nodes have the same connectivity Centralization = 0 because all nodes have the same connectivity of 2 Centralization = 1 because it has a star topology

Heterogeneity • Heterogeneity: coefficient of variation of the connectivity • Highly heterogeneous networks exhibit hubs

Clustering Coefficient Measures the cliquishness of a particular node « A node is cliquish if its neighbors know each other » This generalizes directly to weighted networks (Zhang and Horvath 2005) Clustering Coef of the white node = 0 Clustering Coef = 1

The topological overlap dissimilarity is used as input of hierarchical clustering • Generalized in Zhang and Horvath (2005) to the case of weighted networks • Generalized in Li and Horvath (2006) to multiple nodes • Generalized in Yip and Horvath (2007) to higher order interactions

Network Significance • Defined as average gene significance • We often refer to the network significance of a module network as module significance.

Hub Gene Significance=slope of the regression line (intercept=0)

Acknowledgements