McPromoter – an ancient tool to predict transcription start sites

McPromoter – an ancient tool to predict transcription start sites UweOhler uwe.ohler@duke.edu Institute for Genome Sciences and Policy Duke University (BDGP/Univ Erlangen)

An extremely simplified view of eukaryotic transcription • Specific information about functional context of genes: proximal promoter/enhancers • Binding sites of specific transcription factors confer activation at the right developmental stage or tissue • General information: the core promoter • Region around the transcription start site (TSS) where RNA polymerase II (pol-II) interacts with general transcription factors • Potentially far away from the translation start site

Probabilistic modeling of promoters • Goal: find TSS / proximal promoters ab initio • Alternative to cDNA alignments • Independent of and in addition to gene prediction • Probabilistic modeling allows to deal with uncertainty • Models for classes of related sequences • Models represent our knowledge about sequences in form of parameters • Parameters are automatically estimated using a representative set of sequences • Model gives probability of sequence to belong to class, here: promoter or non-promoter (coding, non-coding)

McPromoter system structure

Non-promoter classes:Stationary Markov chains • Markov chain as tree • Every node corresponds to a context • Contains probability distribution • Typical order: 6 (4,096 overall parameters) • Probability of a sequence • Approximation: Restrict context to the last N symbols (N-th order chain) • Variations on Markov chains • Variable Order: Leaves on different levels • Interpolated: Combination of parameter values from different levels

Promoter model • Simple approach: Markov chain model • Better: Take structure into account • Generalized hidden Markov model • Each state contains a submodel for a specific promoter part, including an explicit length distribution • Interpolated Markov chains as submodels Ohler et al., Bioinformatics 1999, PSB 2000

Example: stat6 promoter http://genes.mit.edu/McPromoter.html

Evaluation of ENCODE regions • Similar problem to alternative splicing: alternative transcription start sites • Traditionally, the window to count false positives has been very large (e.g., -2,000/+2,000),and close predictions within a large window are merged • Evaluate on a per gene basis, i.e. count a true positive if it hits at least one of the annotated TSSs • Second problem: False negatives • After GASP, counting only those predictions internal to the annotated transcripts is the de facto standard • 435 genes / 1,022 different TSSs • Another problem: Circularity? (use of Eponine) Reese et al., Genome Res (2000)

Results in the ENCODE region • Standard paramters, NO repeat masking, merging predictions within 2,000 nt:695 predictions • Positive region -2,000/+2,000:204 TP / 197 genes (sn 47%); 77 FP (sp 73%); 414 unknown • More stringent: -500/+500169 TP (sn 39%)101 FP (sp 63%) • Does it make sense to move towards a more detailed evaluation?

Thanks to... Berkeley Drosophila Genome Project Gerry Rubin Martin Reese Suzi Lewis Erlangen – Institute for Computer Science Heinrich Niemann Stefan Harbeck Georg Stemmer

McPromoter – an ancient tool to predict transcription start sites