100 likes | 119 Views
"McPromoter is a probabilistic modeling tool to predict transcription start sites (TSS) and proximal promoters. It provides specific information about the functional context of genes, binding sites of transcription factors, and general information about the core promoter region. The tool is an alternative to cDNA alignments and allows for dealing with uncertainty in sequence classification. The system structure includes non-promoter classes, stationary Markov chains, and a promoter model utilizing generalized hidden Markov models."
E N D
McPromoter – an ancient tool to predict transcription start sites UweOhler uwe.ohler@duke.edu Institute for Genome Sciences and Policy Duke University (BDGP/Univ Erlangen)
An extremely simplified view of eukaryotic transcription • Specific information about functional context of genes: proximal promoter/enhancers • Binding sites of specific transcription factors confer activation at the right developmental stage or tissue • General information: the core promoter • Region around the transcription start site (TSS) where RNA polymerase II (pol-II) interacts with general transcription factors • Potentially far away from the translation start site
Probabilistic modeling of promoters • Goal: find TSS / proximal promoters ab initio • Alternative to cDNA alignments • Independent of and in addition to gene prediction • Probabilistic modeling allows to deal with uncertainty • Models for classes of related sequences • Models represent our knowledge about sequences in form of parameters • Parameters are automatically estimated using a representative set of sequences • Model gives probability of sequence to belong to class, here: promoter or non-promoter (coding, non-coding)
Non-promoter classes:Stationary Markov chains • Markov chain as tree • Every node corresponds to a context • Contains probability distribution • Typical order: 6 (4,096 overall parameters) • Probability of a sequence • Approximation: Restrict context to the last N symbols (N-th order chain) • Variations on Markov chains • Variable Order: Leaves on different levels • Interpolated: Combination of parameter values from different levels
Promoter model • Simple approach: Markov chain model • Better: Take structure into account • Generalized hidden Markov model • Each state contains a submodel for a specific promoter part, including an explicit length distribution • Interpolated Markov chains as submodels Ohler et al., Bioinformatics 1999, PSB 2000
Example: stat6 promoter http://genes.mit.edu/McPromoter.html
Evaluation of ENCODE regions • Similar problem to alternative splicing: alternative transcription start sites • Traditionally, the window to count false positives has been very large (e.g., -2,000/+2,000),and close predictions within a large window are merged • Evaluate on a per gene basis, i.e. count a true positive if it hits at least one of the annotated TSSs • Second problem: False negatives • After GASP, counting only those predictions internal to the annotated transcripts is the de facto standard • 435 genes / 1,022 different TSSs • Another problem: Circularity? (use of Eponine) Reese et al., Genome Res (2000)
Results in the ENCODE region • Standard paramters, NO repeat masking, merging predictions within 2,000 nt:695 predictions • Positive region -2,000/+2,000:204 TP / 197 genes (sn 47%); 77 FP (sp 73%); 414 unknown • More stringent: -500/+500169 TP (sn 39%)101 FP (sp 63%) • Does it make sense to move towards a more detailed evaluation?
Thanks to... Berkeley Drosophila Genome Project Gerry Rubin Martin Reese Suzi Lewis Erlangen – Institute for Computer Science Heinrich Niemann Stefan Harbeck Georg Stemmer