250 likes | 387 Views
Computational detection of cis-regulatory modules. Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium Slides by Chulyun Kim Presented by Saurabh Sinha. Contents. Introduction Methods Methodology overview Score functions
E N D
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium Slides by Chulyun Kim Presented by Saurabh Sinha
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
Motivation • The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors • These factors bind to cis-regulatory modules(CRMs) located in the neighborhood of the gene • By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription
Related Works • Yuh et al., 1998: Working with combinations of factors makes it possible to integrate multiple inputs and this further provides cross-coupling of a signal transduction and gene regulatory path ways • Bray et al., 2003: AVID, alignment algorithm designed to identify functional non coding segments • Aerts et al., 2003: delineation of putative regions containing CRMs in large intergenic sequences • Thijs et al., 2002: detecting DNA motifs by their statistical over-representation in a set of sequences • Aerts et al., 2003: detecting over-represented hits of known TFBSs • Recently, exploiting colocalization to find true biding sites in a particular gene yields valuable hypotheses regarding transcriptional regulation
Problem • To find the best combination of transcription factor binding sites(TFBSs) that occur several times across multiple coregulated human genes • Specifically within syntenic regions with respective mouse orthologous genes
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
Data • Human-mouse orthologous pairs • 10kb of sequence upstream of the coding sequence of the human and mouse gene from Ensemble release 9 • 18,778 pairs with successful selection
Alignment and Parsing • Alignment • Each 10kb pair was aligned with AVID • Parsing • The alignment output was parsed using VISTA • Select regions with at least 75% identity in windows of 100 bp • 33,282 regions in total • Syntenic fastA database
Background Model and MotifScanner • Background Model • 3rd-order Markov model is calculated form Syntenic fastA database • For scoring and generating artificial dataset • MotifScanner • All syntenic regions are scanned to predict trascription factor binding sites(TFBSs) • TRANSFAC: Frequency matrices • All occurrences are stored in GFF format in Syntenic GFF database GFF (Gene-Finding Format or General Feature Format): a protocol for the transfer of feature information Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> PO A C G T 0112 4 3 1 A 023 2 11 4 G 0311 2 4 3 A …..
Coregulated Genes • Sets of coexpressed genes • From SOURCE database for cyclin B2 • Dataset of gene expression during the cell cycle in a human cancer cell line • 44 genes might share a common cis-regulatory element • Of these, 34 had a Ensemble identifier • Among them, 13 genes have at least one syntenic region with the respective mouse gene • 32 regions in total
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
Scoring single TFBSs • Combining a position-specific frequency matrix Θ (PSFM) and a higher-order background model Bm • How likely it is that the segment is generated by the motif model with respect to the background • x is a segment [b1, b2, … , bw] • Bj is the nucleotide found at position j in x • Θ(bj, j) is the probability of fiding bj at position j according to the PSFM • P(bj | s, Bm) is the probability of finding bj in the sequence according to the background model
Matrix similarity • Redundancy of motif model • There can be multiple matrices describing the same TF • There can be distinct TFs with similar PSFMs • Kullback-Leiber distance between two motif models • Θ1(j,b) is the probability of finding base b at position j in Motif 1 • w is the length of the motif • A is the set of all possible alignments for an allowed shift • The motif models can be grouped into classes depending on a threshold on this average distance
Module Score Function • A biding site and a motif model (a frequency matrix) CRMs and CRM models • CRMs: clusters of actual binding sites on a sequence • CRM models: sets of motif models • The score of a CRM model m on a set of sequences s=(s1,…,sn)
The score of a CRM model mon a sequence s • m is a collection of motif models Θ1, …, Θl • is a set of matching binding sites • represents a count over the occurring TFBSs of model Θi in sequence s • If the number of the occurrences is q, can take any value in 0, … , q • is the kth instance of Θi on sequence s • is the score of single TFBS • b(t) is a boolean function expressing whether the given combination of TFBSs is valid or not • Overlap between different TFBSs • The sites within the specified window length distance constraint • p(t) is the penalization function of CRMS • The number of occurring sites divided by the number of motif models l • The score does not take the motif order into account
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
ModuleSearch • Since the order of sites is not considered, CRM models can be sorted in alphabetical order • nΘ which is the number of sites a module should contain is given • Search for the best CRM model on a set of coregulated genes • Typical Best-First / Branch-and-bound search • From empty model, expand incomplete models by adding a model in a different class until there is no incomplete models whose overestimate heuristic score is greater than the score of the current best complete model • The model having the best heuristic score is first expanded
Heuristic Score • is the score function without penalization of m • is an overestimate heuristic value of the rise in score from CRM model m to the best child CRM model • [Θi] is a CRM model containing one matrix Θi • t = ( ) (Θl +1 , …, Θe) • is a boolean function expressing whether the classes of motif models, when added to m, their class are all different or not
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
Semi-Artificial Sequences • Artificial sequences were generated by sampling symbols from the background model
Detecting Modules in Microarray Clusters • Selected gene cluster around cyclin B2 • The best module model in the cluster selected by ModuleSearcher • window=100 bp and nΘ=4 • [NFY, STAF, TCF4, CEBPA]
Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions
Conclusions • the scoring functions of module for syntenic regions and the algorithm to find the best scoring module were proposed • They have tested the proposed algorithm on artificial data and showed that wit could find the hidden modules with a high sensitivity • They predicted a module in a set of coexpressed genes and validated the prediction using the same approach