Computational detection of cis-regulatory modules

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium Presented by Chulyun Kim

Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

Motivation • The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors • These factors bind to cis-regulatory modules(CRMs) located in the neighborhood of the gene • By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription

Related Works • Yuh et al., 1998: Working with combinations of factors makes it possible to integrate multiple inputs and this further provides cross-coupling of a signal transduction and gene regulatory path ways • Bray et al., 2003: AVID, alignment algorithm designed to identify functional non coding segments • Aerts et al., 2003: delineation of putative regions containing CRMs in large intergenic sequences • Thijs et al., 2002: detecting DNA motifs by their statistical over-representation in a set of sequences • Aerts et al., 2003: detecting over-represented hits of known TFBSs • Recently, exploiting colocalization to find true biding sites in a particular gene yields valuable hypotheses regarding transcriptional regulation

Problem • To find the best combination of transcription factor binding sites(TFBSs) that occur several times across multiple coregulated human genes, specifically within syntenic regions with respective mouse orthologous genes

Methodology Overview

Data • Human-mouse orthologous pairs • 10kb of sequence upstream of the coding sequence of the human and mouse gene from Ensemble release 9 • 18,778 pairs with successful selection

Alignment and Parsing • Alignment • Each 10kb pair was aligned with AVID • Parsing • The alignment output was parsed using VISTA • Select regions with at least 75% identity in windows of 100 bp • 33,282 regions in total • Syntenic fastA database

Background Model and MotifScanner • Background Model • 3rd-order Markov model is calculated form Syntenic fastA database • For scoring and generating artificial dataset • MotifScanner • All syntenic regions are scanned to predict trascription factor binding sites(TFBSs) • TRANSFAC: Frequency matrices • All occurrences are stored in GFF format in Syntenic GFF database GFF (Gene-Finding Format or General Feature Format): a protocol for the transfer of feature information Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> PO A C G T 0112 4 3 1 A 023 2 11 4 G 0311 2 4 3 A …..

Coregulated Genes • Sets of coexpressed genes • From SOURCE database for cyclin B2 • Dataset of gene expression during the cell cycle in a human cancer cell line • 44 genes might share a common cis-regulatory element • Of these, 34 had a Ensemble identifier • Among them, 13 genes have at least one syntetic region with the respective mouse gene • 32 regions in total

Scoring single TFBSs • Combining a position-specific frequency matrix Θ (PSFM) and a higher-order background model Bm • How likely it is that the segment is generated by the motif model with respect to the background • x is a segment [b1, b2, … , bw] • Bj is the nucleotide found at position j in x • Θ(bj, j) is the probability of fiding bj at position j according to the PSFM • P(bj | s, Bm) is the probability of finding bj in the sequence according to the background model

Matrix similarity • Redundancy of motif model • There can be multiple matrices describing the same TF • There can be distinct TFs with similar PSFMs • Kullback-Leiber distance between two motif models • Θ1(j,b) is the probability of finding base b at position j in Motif 1 • w is the length of the motif • A is the set of all possible alignments for an allowed shift • The motif models can be grouped into classes depending on a threshold on this average distance

Module Score Function • A biding site and a motif model (a frequency matrix)  CRMs and CRM models • CRMs: clusters of actual binding sites on a sequence • CRM models: sets of motif models • The score of a CRM model m on a set of sequences s=(s1,…,sn)

The score of a CRM model mon a sequence s • m is a collection of motif models Θ1, …, Θl • is a set of matching binding sites • represents a count over the occurring TFBSs of model Θi in sequence s • If the number of the occurrences is q, can take any value in 0, … , q • is the kth instance of Θi on sequence s • is the score of single TFBS • b(t) is a boolean function expressing whether the given combination of TFBSs is valid or not • Overlap between different TFBSs • The sites within the specified window length  distance constraint • p(t) is the penalization function of CRMS • The number of occurring sites divided by the number of motif models l • The score does not take the motif order into account

ModuleSearch • Since the order of sites is not considered, CRM models can be sorted in alphabetical order • nΘ which is the number of sites a module should contain is given • Search for the best CRM model on a set of coregulated genes • Typical Best-First / Branch-and-bound search • From empty model, expand incomplete models by adding a model in a different class until there is no incomplete models whose overestimate heuristic score is greater than the score of the current best complete model • The model having the best heuristic score is first expanded

Heuristic Score • is the score function without penalization of m • is an overestimate heuristic value of the rise in score from CRM model m to the best child CRM model • [Θi] is a CRM model containing one matrix Θi • t = ( )  (Θl +1 , …, Θe) • is a boolean function expressing whether the classes of motif models, when added to m, their class are all different or not

Semi-Artificial Sequences • Artificial sequences were generated by sampling symbols from the background model

Detecting Modules in Microarray Clusters • Selected gene cluster around cyclin B2 • The best module model in the cluster selected by ModuleSearcher • window=100 bp and nΘ=4 • [NFY, STAF, TCF4, CEBPA]

Conclusions • the scoring functions of module for syntenic regions and the algorithm to find the best scoring module were proposed • They have tested the proposed algorithm on artificial data and showed that wit could find the hidden modules with a high sensitivity • They predicted a module in a set of coexpressed genes and validated the prediction using the same approach

Computational detection of cis-regulatory modules