270 likes | 386 Views
Computational detection of genomic cis -regulatory modules applied to body patterning in the early Drosophila embryo. N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia Presented by Bin Tan. Cis-regulatory modules (CRM).
E N D
Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia Presented by Bin Tan
Cis-regulatory modules (CRM) • In higher eukaryotes, many genes show complex spatial-temporal expression patterns. • Gene transcription regulation apparatus is largely organized in the form of separable cis-regulatory modules. • A module integrates inputs from several transcription factors and regulates another gene’s expression, forming a regulatory network.
Structural features of modules • Hundreds of nucleotides in length • Contains binding sites for as many as 4-5 different transcription factors • Possibly multiple binding sites for the same transcription factor • Certain combinations of binding sites: correlations between different transcription factors
Why computational methods? • Pure experimental methods such as “promoter bashing” is tedious. • It is easier to screen a modest list of candidates suggested by a computational method.
About this paper • Uses data on body patterning of the early Drosophila embryo • Makes statistically significant predictions of regulatory modules using three different levels of prior information: • Binding sites (motifs) • Several related modules • Only genome
Three levels of prior information:1. Binding sites (motifs)2. Several related modules3. Only genome
The Ahab algorithm • Uses known binding sites (motifs) information • Scans the genome in windows • Scores each window according to how well the sequence can be stochastically generated from the motifs • Outputs windows with high ranks
Ahab features (As compared to Mobydick) • Uses positional weight matrices as the motif model • Introduces a local background to remove influence from local variations in sequence composition • Allows binding sites to overlap • Allows weak binding sites to contribute to the score • No parameter tuning (other than the window size)
( ) f s s s k k k 2 1 ¡ ¡ ( j ) p s s s = b k k k 2 1 ¡ ¡ ( ) f s s k k 2 1 ¡ ¡ Algorithm details • Background model: k-th order Markov chain (each nucleotide is only dependent on the preceding k nucleotides)
Y Y ( j ( j ) ) ( j ( ) ) i p s p w s w p s w s s s = = b k b k i i i i 2 1 ¡ ¡ ; Algorithm details (cont.) • Sequence S=s1s2.. • Weight matrices w1 w2 .. for motifs • Background wb • Probabilistic generation of S: • Choose a motif or background wk=1,2,..b with probability pk • Sample a sequence according to w and append it to S • Repeat until S reaches a certain length
X X t t ( j ) ( j ) ( j ) ( ( j ) j ) µ l µ µ l µ l µ µ Q S T S S T S T o g p p o g p o g p = = ; ; ; T T 1 + t t ( j ) µ µ µ Q a r g m a x = µ Algorithm details (cont.) • Unknown arametersθ: p1 p2 .. pb • Maximize • Conjugate descent or EM algorithm
Experiment setup • Input : weight matrices for 8 transcription factors constructed from 11 modules • Window size: 500 bp • 27 modules known to receive maternal/gap gene input
Results • 146 highly significant modules found • For 27 known modules: • 11+6 recovered • 3 when filtering for at least 3 different factors • 3 because they contain only other factors • 4 ranked very low (>700) • For 15 novel predictions: • one of the adjacent genes is patterned in the blastoderm
Estimation of positive rate • Scramble the columns in the weight matrices: half as many predictions -> 50% false positive rate • (6+15)*3/(146-11) -> 50% positive rate
Experiment variations • Remove the least specific matrix (tailless) from input: • 75% of the predictions without using tailless are also present in the list of 146 • Vary window size to 700bp: • 58% in the list of 146 are also among the top 200 of the 700bp set • Interesting new predictions
Three levels of prior information:1. Binding sites (motifs)2. Several related modules3. Only genome
Motivation • For most transcription factors, binding site information is rarely known • Modules obtained by experimental methods (e.g. promoter bashing) are more common
The method • Uses standard motif finders* to recover weight matrices from input modules • Feed the motifs to Ahab to find similarly regulated genes
The method (cont.) • Gibbs sampler algorithm: • Lawrence et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. (Presented by Xin He) • Customizations: • Search for only one binding site at a time. • Mask only the central 1-2 bases of each motif before iterating. • -> Results are more reproducible between runs. • -> Motifs are allowed to overlap.
Experiment results Testing on modules with known binding site information: • Gibbs sampling predicts 30-50% of the sequence is covered by motifs • Gibbs motifs has higher specificity • Recovers half of the known motifs • Predicts several new interesting motifs
Experiment results (cont.) • Input: 3 modules receiving inputs from 6 transcription factors • 6 highly significant weight matrices found • Kr, Kni, (Hb+Cad) + 3 new • Ahab finds 63 highly significant modules • 4 overlaps with the input modules • 13 contiguouss to genes patterned in the blastoderm • Comparable positive rates
Three levels of prior information:1. Binding sites (motifs)2. Several related modules3. Only genome
The Argos algorithm • Only uses the genome data (Unsupervised) • Motivation: Is the redundancy of binding sites inside modules strong enough to predict modules alone? • The first successful attempt to do this for a metazoan genome
The Argos algorithm • To determine whether a motif is locally overrepresented: Score its frequency in the sequence against its expected frequency (according to genome wide background). • Enumerate all possible motifs of length 8. • Compute their frequency in the genome (background counts), allowing 2 mutations
The Argos algorithm (cont.) • Move a sliding window S over the genome • Compute a motif’s local count c in S • Compute the motif’s expected count from background • Rank the motifs by their Poisson scores • The motifs are often related to each other: • Greedily select the top motif and eliminate related ones (under shifts and up to 4 mutations) • Repeat until 5 motifs have been produced • Use the sum of the selected motifs’ scores as the score for S
Experiment results • For a certain set of modules, Argos recovers half of them -> 50% false negative rate • For several genes with 15 known modules, Argos recovers 7 when looking over 10kbp upstream of translation start • Genome wide, roughly one module per gene