Mehmet Koyut ü rk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey,

ALGORITHMIC & ANALYTICAL METHODS FOR FUNCTIONAL CHARACTERIZATION OF MOLECULAR INTERACTION NETWORKS Mehmet Koyutürk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey, Wojciech Szpankowski, and Ananth Grama

OUTLINE Biological motivation Gene regulation, molecular annotation, pathway annotation Formal framework Functional attribute networks: Multigraph model Algorithmic challenges Statistical interpretability, non-monotonicity Statistical model Conditioning on building blocks to emphasize modularity Resulting tool NARADA, algorithms, implementation, results

GENE REGULATION Gene expression is the process of synthesizing a functional protein coded by the corresponding gene Genes (& their products) regulate (promote / suppress) the extent of each other’s expression Any step of gene expression can be modulated Transcription, translation, post-transcriptional modification, RNA transport, mRNA degradation… Negative ligand independent transcriptional regulation at chromatin level

GENE REGULATORY NETWORKS Abstraction: organization of regulatory interactions in the cell Genes are nodes, regulatory interactions are directed edges Boolean network model: Edges are signed, indicating up- (promotion) and down-regulation (supression) Up-regulation Gene Down-regulation Flowering time in Arabidopsis

MOLECULAR ANNOTATION Similar systems that involve different molecules (genes, proteins) in different species Functional annotation of genes provides a unified understanding of the underlying principles Gene Ontology: A library of molecular annotation Molecular function: What is the role of a gene? Biological process: In which processes is a gene involved? Cellular component: Where is a gene’s product localized? We refer to each annotation class as a functional attribute

FROM MOLECULES TO SYSTEMS Networks are species-specific Annotation is at the molecular level Map networks from gene space to function space Can generate a library of annotated (sub-) networks Network of Gene Ontology terms based on significance of pairwise interactions in S. cerevisiae Synthetic Gene Array (SGA) network (Tong et al., Science, 2004)

INDIRECT REGULATION Assessment of pairwise interactions is simple, but not adequate g1 g3 g5 g1 g3 g5 g2 g6 g4 g2 g4 g6

FUNCTIONAL ATTRIBUTE NETWORK Multigraph model A gene is associated with multiple functional attributes A functional attribute is associated with multiple genes Functional attributes are represented by nodes Genes are represented by ports, reflecting context g1 g3 g5 g2 g6 g4 Functional attribute network Gene network

FREQUENCY OF A MULTIPATH A pathway of functional attributes occurs in various contexts in the gene network Multipath in the functional attribute network 0 4 Frequency of multipath ?

SIGNIFICANCE OF A PATHWAY We want to identify multipaths with unusual frequency These might correspond to modular pathways Frequency alone is not a good measure of statistical significance The distribution of functional attributes among genes is not uniform The degree distribution in the gene network is highly skewed Pathways that contain common functional attributes have high frequency, but they are not necessarily interesting

STATISTICAL INTERPRETABILITY We are interested in identifying statistically over-represented patterns Null hypothesis: the pattern is sparse Additional positiveobservation => more significance Additional negative observation => less significance B B’ A P(B’) > P(A) P(B) < P(A)

MONOTONICITY Frequency is a monotonic measure If a pathway is frequent, then all of its sub-paths are frequent Algorithmic advantage: enumerate all frequent patterns in a bottom-up fashion Commonly exploited in traditional data mining applications Statistically interpretable measures are not monotonic! Statistical significance fluctuates in the search space Existing data mining algorithms do not apply Significance of pathways are non-monotonic in two dimensions: GO Hierarcy & path space

GO HIERARCHY Functional attributes are organized in a hierarchical manner “regulation of steroid biosynthetic process” is a “regulation of steroid metabolic process” and is part of “steroid biosynthetic process” Statistically interpretable measures are not monotonic with respect to GO hierarchy A pattern corresponding to child may be more significant or less significant than that corresponding to its parent Common example: Identification of significantly enriched GO terms in a set of genes (Ontologizer, VAMPIRE)

MONOTONICITY W.R.T. GO g1, g2, g3 g1, g2, g4 GO DAG: g3 g4 g1, g2 g1 g3 Gene network: g5 g2 g4 P( ) < P( ) < P( )

PATHWAY LENGTH Open problems How can we effectively search in the pathway space, where significance fluctuates? How can we find optimal resolution in functional attribute space? P( ) > P( ) P( ) < P( )

STATISTICAL MODEL: INSIGHT Emphasize modularity of pathways Condition on frequency of building blocks Evaluate the significance of the coupling of building blocks g1 g3 g5 g7 g2 g4 g6 φ( ) = φ( ) = 4 φ( ) φ( ) = 2 = 5 = φ( ) => P( ) < P( )

STATISTICAL MODEL: FORMULATION We denote each frequency random variable by Φ, their realization by φ Significance of pathway π123 ( p123 ) P (Ф123≥φ123 |Ф12=φ12,Φ23= φ23,Φ1= φ1,Φ2= φ2,Φ3= φ3) π123: Φ1 Ф2 Ф3 Φ12 Φ23 Φ123

SIGNIFICANCE OF A PATHWAY Assume that regulatory interactions are independent There are φ12 φ23 posible pairs of π12and π23edges The probability that a pair of π12and π23edges go through the same gene (corresponds to an occurrence of π123)is 1/φ2 The probability that at least φ123 of these pairs go through the same gene can be bounded by p123≤ exp(φ12φ23Hq(t)) where q = 1/φ2 and t =φ123 / φ12φ23 Hq(t) = t log(q/t) +(1-t) log((1-q)/(1-t)) is divergence

BASELINE MODEL A single regulatory interaction is the shortest pathway Arbitrary degree distribution: The number of edges leaving and entering each functional attribute is specified Edges are assumed to be independent The frequency of a regulatory interaction is a hypergeometric random variable Can derive a similar bound for the p-value of a single regulatory interaction

ALGORITHMIC ISSUES Significance is not monotonic Need to enumerate all pathways? Strongly significant pathways A pathway is strongly significant if all of its building blocks and their coupling are significant (defined recursively) Allows pruning out the search space effectively Shortcuttingcommon functional attributes Transcription factors, DNA binding genes, etc. are responsible for mediating regulation Shortcut these terms, consider regulatory effect of different processes on each other directly

NARADAhttp://www.cs.purdue.edu/homes/jpandey/narada/ A software for identification of significant pathways Queries Given functional attribute T, find all significant pathways that originate at T Given functional attribute T, find all significant pathways that terminate at T Given a sequence of functional attributes T1, T2, …, Tk, find all occurrences of the corresponding pathway Identified pathways are displayed as a tree User can explore back and forth between the gene network and the functional attribute network

RESULTS E. coli transcription network obtained from RegulonDB 3159 regulatory interactions between 1364 genes Using Gene Ontology, 881 of these genes are mapped to 318 processes

MOLYBDATE ION TRANSPORT Significant regulatory pathways that originate at molybdate ion transport Their occurrences in the gene network

WHAT IS SIGNIFICANT? Molybdate ion transport regulates various processes directly Mo-molybdopterin cofactor biosynthesis, oligopeptide transport, cytochrome complex assembly It regulates various other processes indirectly Through DNA-dependent regulation of transcription, two-component signal transduction system, nitrate assimilation Regulation of these mediator processes is not significant on itself! NARADA captures modularity of indirect regulation!

CONCLUSION Mapping gene regulatory networks to functional attribute space demonstrates great potential Abstract, unified understanding of regulatory systems Algorithmically, a wide range of new challenges Boundinginterpretable statistical measures Handling resolutionin functional attribute space Generalizing the definition of a pathway Discovering new information Projecting identified “canonical” patterns on other networks to discover new regulatory relationships

ACKNOWLEDGMENTS Shankar Subramaniam Wojciech Szpankowski Yohan Kim Jayesh Pandey Ananth Grama

Mehmet Koyut ü rk PURDUE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Joint work with Jayesh Pandey,