N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia Presented by Bin Tan

Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia Presented by Bin Tan

Cis-regulatory modules (CRM) • In higher eukaryotes, many genes show complex spatial-temporal expression patterns. • Gene transcription regulation apparatus is largely organized in the form of separable cis-regulatory modules. • A module integrates inputs from several transcription factors and regulates another gene’s expression, forming a regulatory network.

Structural features of modules • Hundreds of nucleotides in length • Contains binding sites for as many as 4-5 different transcription factors • Possibly multiple binding sites for the same transcription factor • Certain combinations of binding sites: correlations between different transcription factors

Why computational methods? • Pure experimental methods such as “promoter bashing” is tedious. • It is easier to screen a modest list of candidates suggested by a computational method.

About this paper • Uses data on body patterning of the early Drosophila embryo • Makes statistically significant predictions of regulatory modules using three different levels of prior information: • Binding sites (motifs) • Several related modules • Only genome

Three levels of prior information:1. Binding sites (motifs)2. Several related modules3. Only genome

The Ahab algorithm • Uses known binding sites (motifs) information • Scans the genome in windows • Scores each window according to how well the sequence can be stochastically generated from the motifs • Outputs windows with high ranks

Ahab features (As compared to Mobydick) • Uses positional weight matrices as the motif model • Introduces a local background to remove influence from local variations in sequence composition • Allows binding sites to overlap • Allows weak binding sites to contribute to the score • No parameter tuning (other than the window size)

( ) f s s s k k k 2 1 ¡ ¡ ( j ) p s s s = b k k k 2 1 ¡ ¡ ( ) f s s k k 2 1 ¡ ¡ Algorithm details • Background model: k-th order Markov chain (each nucleotide is only dependent on the preceding k nucleotides)

Y Y ( j ( j ) ) ( j ( ) ) i p s p w s w p s w s s s = = b k b k i i i i 2 1 ¡ ¡ ; Algorithm details (cont.) • Sequence S=s1s2.. • Weight matrices w1 w2 .. for motifs • Background wb • Probabilistic generation of S: • Choose a motif or background wk=1,2,..b with probability pk • Sample a sequence according to w and append it to S • Repeat until S reaches a certain length

X X t t ( j ) ( j ) ( j ) ( ( j ) j ) µ l µ µ l µ l µ µ Q S T S S T S T o g p p o g p o g p = = ; ; ; T T 1 + t t ( j ) µ µ µ Q a r g m a x = µ Algorithm details (cont.) • Unknown arametersθ: p1 p2 .. pb • Maximize • Conjugate descent or EM algorithm

Experiment setup • Input : weight matrices for 8 transcription factors constructed from 11 modules • Window size: 500 bp • 27 modules known to receive maternal/gap gene input

Results • 146 highly significant modules found • For 27 known modules: • 11+6 recovered • 3 when filtering for at least 3 different factors • 3 because they contain only other factors • 4 ranked very low (>700) • For 15 novel predictions: • one of the adjacent genes is patterned in the blastoderm

Estimation of positive rate • Scramble the columns in the weight matrices: half as many predictions -> 50% false positive rate • (6+15)*3/(146-11) -> 50% positive rate

Experiment variations • Remove the least specific matrix (tailless) from input: • 75% of the predictions without using tailless are also present in the list of 146 • Vary window size to 700bp: • 58% in the list of 146 are also among the top 200 of the 700bp set • Interesting new predictions

Motivation • For most transcription factors, binding site information is rarely known • Modules obtained by experimental methods (e.g. promoter bashing) are more common

The method • Uses standard motif finders* to recover weight matrices from input modules • Feed the motifs to Ahab to find similarly regulated genes

The method (cont.) • Gibbs sampler algorithm: • Lawrence et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. (Presented by Xin He) • Customizations: • Search for only one binding site at a time. • Mask only the central 1-2 bases of each motif before iterating. • -> Results are more reproducible between runs. • -> Motifs are allowed to overlap.

Experiment results Testing on modules with known binding site information: • Gibbs sampling predicts 30-50% of the sequence is covered by motifs • Gibbs motifs has higher specificity • Recovers half of the known motifs • Predicts several new interesting motifs

Experiment results (cont.) • Input: 3 modules receiving inputs from 6 transcription factors • 6 highly significant weight matrices found • Kr, Kni, (Hb+Cad) + 3 new • Ahab finds 63 highly significant modules • 4 overlaps with the input modules • 13 contiguouss to genes patterned in the blastoderm • Comparable positive rates

The Argos algorithm • Only uses the genome data (Unsupervised) • Motivation: Is the redundancy of binding sites inside modules strong enough to predict modules alone? • The first successful attempt to do this for a metazoan genome

The Argos algorithm • To determine whether a motif is locally overrepresented: Score its frequency in the sequence against its expected frequency (according to genome wide background). • Enumerate all possible motifs of length 8. • Compute their frequency in the genome (background counts), allowing 2 mutations

The Argos algorithm (cont.) • Move a sliding window S over the genome • Compute a motif’s local count c in S • Compute the motif’s expected count from background • Rank the motifs by their Poisson scores • The motifs are often related to each other: • Greedily select the top motif and eliminate related ones (under shifts and up to 4 mutations) • Repeat until 5 motifs have been produced • Use the sum of the selected motifs’ scores as the score for S

Experiment results • For a certain set of modules, Argos recovers half of them -> 50% false negative rate • For several genes with 15 known modules, Argos recovers 7 when looking over 10kbp upstream of translation start • Genome wide, roughly one module per gene

Experiment results

N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia Presented by Bin Tan

N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia Presented by Bin Tan

Presentation Transcript

U N E P / M A P

U N E P / M A P

U N E P / M A P

U N E P / M A P

U N E P / M A P

N u m b e r s

Q u a n t u m N u m b e r s

U N E P / M A P

Presented by: Tan Q. Nguyen

N u m b e r s

N u m b e r s

G = (  (u 1 ,n)r e u 1 ,n M n ,...,