210 likes | 356 Views
Using a Beagle to sniff for Bacterial Promoters. Stefan R. Maetschke, Michael Towsey and James M. Hogan Queensland University of Technology. An Agenda. Bacterial Promoters The domain and the motifs Earlier approaches, including ours Why dumber is better
E N D
Using a Beagle to sniff for Bacterial Promoters Stefan R. Maetschke, Michael Towsey and James M. Hogan Queensland University of Technology
An Agenda • Bacterial Promoters • The domain and the motifs • Earlier approaches, including ours • Why dumber is better • Not quite, but flexibility before sophistication • Exploiting new features as they are identified • Results
RNA polymerase s transcription gene promoter GSS TSS Upstream from a Bacterial Gene • Search for ‘conserved’ -10 and -35 hexamers • Except they’re not really conserved • Plagued by massive false positive rates • But this is the Reader’s Digest version
Previous Work s70 • Mainly in the E. coli system • PWMs – simple, but poor discrimination • Good performance if compound structure used • (Collado-Vides et. al.: State of the art pre 2006) • HMMs – less successful than in eukaryotes • TDNNs – boosted by GSS offset distribution • SVMs – spectrum kernel ensemble • (Gordon et. al. (us): state of the art, but at a price)
Beagle • Principled and rapid inclusion of motifs as they are discovered or hypothesised • Prior to the Gordon et. al. paper, a TP:FP ratio of 1:300 was considered good. • But this was based solely on -10 and -35 motifs • A model description language and parser • Less sophisticated than it sounds, but sufficient • Iterative refinement of the model
Upstream from a Bacterial Gene Core Enzyme: aabb’w w Specific sigma controls binding at -10, -35 elements But binding probability varies enormously Compensate when hexamers are weak a b b’ a s s1 s4 s2 s3 ATG TTGACA TATAAT -35 element -10 element TSS GSS “It has long been known that domains 2 and 4 … bind to the strongly conserved -10 and -35 boxes”. Except when they don’t because they aren’t…
Upstream from a Bacterial Gene Simple Extended -10: TG Discovered in B. Subtilis, found in 20% of promoters in E. Coli -16 hypothesised to be important in E. Coli, TRTG or T(AG)TG consensus s70 w a b b’ a s s1 s4 s2 s3 TRTG ATG TTGACA TATAAT -35 element Extended -10 element TSS GSS But even the alpha units aren’t what they seem…
Upstream from a Bacterial Gene aCTDs are carboxy terminal domains, binding to UP elements AT-rich region, proximal element more important w a aNTD2 b b’ a aNTD1 s s1 s4 aCTD1 aCTD2 s2 s3 AAAAAARNR TRTG AWWWWWTTTTT ATG TTGACA TGTATAAT distal UP element proximal UP element -35 element -16 Extended -10 element TSS GSS
The Data • E. Coli and B. Subtilis • Confirmed TSS locations within 250bp of the nearest gene start • No overlapping reading frames • N=492 (E. Coli), 205 (B. Subtilis) • 250 bp USRs available
Beagle algorithm • Define a consensus promoter • e.g. <TTGACA (15, 21) TATAAT (4, 13) TSS> • Ordered pairs specify gap ranges • Parse the description and define PWMs and weighted gaps • Initially trivial • Refine using the confirmed TSS locations
Beagle algorithm • For each USR in the training set: • Anchor the pattern to the known TSS location • Determine the best match based on the current model • Find the MLE of the model parameters based on the best matches from the training data. • Test the refined definition on unseen data • 10 repeats x 10 fold cross validation • Essentially TSS prediction • Iterate until improvement ceases.
TSS recognition (% accuracy) Guess which promoter boxes are more strongly conserved…
Including UP elements • NNW15NN • AT rich region • NNAAAWWTWTTNNAAANNN • Estrem et al 1998 • NNAAAWWTWTTN – A6RNR • Gourse et al 2000 • distal - proximal motif
Comparing E. coli and B. subtilis promoters B. subtilis -35 element B. subtilis -10 element E. coli -35 element E. coli -10 element E. Coli has 7 known sigmas; B. Subtilis 18…
Motifs ‘in the Gap’ • Extended -10 element • Consensus TGTATAAT • Strongly implicated in Subtilis • Hypothesised as significant in 20% E Coli • Extended -16 element • Consensus TRTG s70
The Complete Picture aNTD b/b’ aCTD I aCTD II aCTD II aCTD II s70 -35 -10 -62 -72 -52 -40.5 UP element AT rich Variable location
TSS recognition (% accuracy) E. coli 43.3% 48.3% B. subtilis 61.2% 71.2% +AT rich 64.8% 62.6% +TRTG +AT rich 47.3% 41.6% +TG
Conclusions • Beagle provides a simple bridge between experiment and computational discovery • Is the extended -16 motif really important in E. Coli? • (Well, not in any general sense) • Fast, robust and flexible • Extensions • Combination of model organisms • Comparative genomics & regulation