1 / 40

Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries

Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries. Philipp Bucher In Silico Analysis of Proteins Celebrating the 20th Anniversary of Swiss-Prot Fortaleza – Brazil, Aug 3 2006.

pia
Download Presentation

Comparative Analysis of Promoter Sequences: The discovery of the Pribnow-box and some follow-up discoveries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Analysis of Promoter Sequences:The discovery of the Pribnow-box and some follow-up discoveries Philipp Bucher In Silico Analysis of Proteins Celebrating the 20th Anniversary of Swiss-Prot Fortaleza – Brazil, Aug 3 2006

  2. Why a talk on promoters at a protein meeting ?Aren’t promoters DNA sequences ? No. promoters are not DNA sequences. Any general representation of promoters, or algorithm to predict promoters, does not relate to intrinsic properties of DNA. In fact, a profile or hidden Markov model representing promoter sequences constitutes a description of the DNA-binding surfaces of a protein in terms of base pair preferences. Not surprisingly therefore, the first consensus sequence for an E.coli promoter element has been derived from seven sequences originating from six different species, including a eukaroytic virus.

  3. Early comparative analysis of E.coli promoter sequences FIG. 4. Comparison of promoter sequences (see text). b, Homologous sequence probably engaged by RNA polymerase; i, mRNA initiation point (underlined). Hyphens have been omitted. SV40, simian virus 40; w.t., wild type. Among the promoter sequences, there is a homologous, 7-base sequence lying to the left of the initiation points. I feel that the DNA sequence 5' T-A-T-Pu-A-T-G 3' 3' A-T-A-Py-T-A-C 5' is implicated in the formation of a tight binary complex with RNA polymerase. Text and Figures from: Pribnow (1975) Proc. Nat. Acad. Sci. USA 72, 784-788.

  4. E. coli promoters: Chapter 2 A second sequence motif located about -35 bp upstream of the initiation site was discovered based on a larger promoter sequence collection.

  5. E.coli promoters: Chapter 3 The figure below illustrates the concept of functional homology between two promoter sequences. In particular, these footprint results confirm that the -35 and -10 elements are correctly assigned even though the spacing between the two elements is different (Siebenlist et al. 1980, Cell 20, 269-281).

  6. The program TargSearch implements an early sequence profile method using position-specific residue weights and scores for alternative spacer lengths.

  7. Prediction of the rate constant for open complex formation with TargSearch scores

  8. Early work on E. coli promoters: Important contributions to computational biology • Representation of functional molecular sequence motifs by IUPAC consensus sequences and weight matrices • A definition of functional homology and an xperimental criterion for correct alignment of DNA sequence motifs. • Prediction algorithms using profile or HMM-like target description. • The idea that quantitative promoter prediction scores can and perhaps should viewed as predictors of a protein property: the selectivity of RNA polymerase to a particular DNA ligand sequence.

  9. Eukaryotic promoters: Differences with regard to E.coli promoters and other biological facts • Eukaryotic polymerases do not have intrinsic affinity to specific promoter sequences. • Eukaryotic promoters are recognized by a variety of transcriptions factors, each recognizing a specific target motif. • The binding sites of proteins which direct RNA polymerase to the promoter, may be located at larger and more variable distances from the initiation sites. Moreover, they these sites may occur in either orientation, or even downstream of the start site. • Tissue and developmental stage-specificity. • Epigenitic silencing mediated by chromatin condensation or DNA methylation.

  10. EPD Essentials Promoter definition: An experimentally mapped transcription initiation site. Important assumption: A capped 5’end of a eukaryotic mRNA is generated by transcriptional initiation, not endonucleolytic cleavage Primary data: (i) RNA sequencing, nuclease protection, primer extension data published in Journal articles, (ii) 5’ESTs from cDNA clones obtained with the oligo-capping method (only recently). Purpose: (i) Comparative analysis of promoter elements, (ii) training and test set for promoter prediction algorithms (iii) resource for experimental researchers.

  11. Signal Search Analysis Essentials • History: Signal Search Analysis is an ancient method developed by myself in the early eighties in Max Birnstiel’s lab in Zurich (first published in 1984) • Purpose: to discover and characterize sequence motifs that occur at constrained distances from physiologically defined sites in nucleic acid sequences. • Recent event: Adaptation of software to new environment, SSA web server, application to promoters and translational start sites. • Note the difference: SSA programs serve to characterize • motifs that occur at constrained distances from sites • not: • motifs that are over-represented within sequence sets • There are hundreds of programs that address the latter problem, but only very few that serve the same purpose as the SSA programs!

  12. Locally Over-represented Sequence Motifs

  13. TATA-box Signal Occurrence Profile for Human Promoters

  14. Definition of a Locally Over-represented Sequence Motif The definition of a locally over-represented sequence motif has three components: • A weight matrix or consensus sequence defining the motif • A cut-off value • A preferred region of occurrence with respect to a functional site, e.g. a transcription initiation sites The weight matrix or consensus sequence allows one to compute a match score for any subsequence of a promoter that has the same length as the matrix. The cut-off value determines which subsequence constitutes a motif match. The preferred region is the third criterion necessary to decide whether a given promoter contains a given locally over-represented sequence motif or not. The difference in occurrence frequency inside and outside of the preferred region can be used as an objective function to optimize the three components of a locally over-represented sequence motif listed above.

  15. An algorithm to optimize a locally over-represented sequence

  16. A weight matrix definition for the TATA-box motif See also. Bucher 1990, J. Mol. Biol.212, 563-578.

  17. Promoter prediction Benchmark results from Fickett & Hatzigeorgiou 1997, Genome Res.7, 861-878 Note: The false/random discovery rates (about 1 in 1 kb) are about 2 orders of magnitude too high if one assumes one promoter per 100 kb for the human genome (perhaps an underestimation). At this unacceptably high false discovery rate the sensitivity barely exceeds 50% for most of the programs.

  18. Why is eukaryotic promoter prediction so hard ? Technical reasons: • Too few promoters mapped experimentally • Low quality of experimental data resulting in inexact or wrong transcription initiation site mapping Biological reasons: • Transcription initiation appears to be often a fuzzy process. The initiation sites pertaining to one promoter may be scattered over 50 bp or more. • There may be many useless promoters giving rise to rapidly degraded non-functional transcripts. • There may be too many promoter classes recognized by different combinations of transcription factors. • Tissue and developmental stage specificity. Most promoters are in fact silent in most tissues. Promoter prediction is partly a tissue-specific problem.

  19. Progress may come from new technologies Introduction of high throughput technologies for cDNA (mRNA) 5’end sequencing. Recent papers: Oligo-capping technique: Suzuki et al. (2001) Identification and Characterization of the Potential Promoter Regions of 1031 kinds of human genes. Genome Res. 11:677-684. CAGE: Carninci et al. (2001) Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. doi:10.1038/ng1789. Close to one million 5’ tags of human transcripts have been analyzed with these techniques. Processing of cDNA 5’tags has tripled the number of promoter entries in EPD in less than two years. We have coined the term “in silico primer extension” designating the process of TSS mapping with cDNA 5’tag data.

  20. In silico primer Extension - Essentials Purpose: • to map transcription start sites to a genome, • to study the regulation of alternative promoter usage Experimental procedures: • full-length cDNA synthesis (e.g. oligo-capping method) • Generation of 5’tags (EST sequencing, 5’SAGE, CAGE) Computational procedures: • mapping of 5’ tags to the genome, • identification of clusters in mRNA 5’end profiles

  21. Promoter region defined by transcription start sites (TSS) genomic DNA conventional primer extension experiment with gene specific primer cDNAs promoter TSS

  22. In Silico (Digital) versus in Vitro (Analog) Primer Extension ccgagtcccctcacccctttccttcccacAGGTCCCTGGCCAAAGATTTATTTCTCTTGACAACCA

  23. Our in Silico Primer Extension Pipeline GenBank/EMBL 5’ EST entries of selected libraries Unigene entry RefSeq entry Blast Trace files cDNA 5’tag (50 nt) Genome sequence (2kb) Profile-based multiple sequence aligment method Zero to several Promoter entries 1-D clustering By MADAP mRNA 5’end profile

  24. Definition of Promoter Sites and Classes from cDNA 5’end Profiles with the Program MADAP 10 bp 45 bp # of 5’end of NEDO transcripts Genomic position R R 84047148-84047231 84046905-84046987

  25. In silico PE versusconventional techniques 100 bp # of 5’end of DBTSS transcripts Genomic position Characterization of three optional promoters in the 5' region of the human aldolase A gene. Maire P. et al (1987) J. Mol. Biol. 197, 425-438

  26. Is in silico primer extension really accurate and reliable enough for promoter analysis ?

  27. Comparative Evaluation of Human promoter Sets Compiled by Different Methods Questions addressed: • What is the overlap and agreement in transcription start sites definitions between the four data sets ? • Is any of the data sets contaminated by a substantial number of non-promoter sequences ? • Which method defines the transcription start site most accurately ? • Is any of the four promoter compilations biased with regard to promoter subclasses ?

  28. Comparative Evaluation of Human promoter Sets Compiled by Different Methods Goal of the project: to compare four different promoter (transcription start sites) compilations: • EPD: manually compiled promoter compilation based primarily on nuclease protection and primer extension experiments published in the biological journal literature. • PRESTA: Automatically compiled promoter collection relying on author submitted sequence feature annotations in EMBL sequence entries and confirmatory evidence from public EST sequences. • DBTSS (NEDO): Transcription starts sites inferred from 5’end sequences of full-length enriched cDNA libraries obtained with the oligo-capping method. • MGC: Transcription starts sites inferred from 5’end sequences of full-length enriched cDNA libraries from the Mammalian Gene Catalog (MGC) program.

  29. Promoter Elements and Sequence Properties used for the Evaluation of Different Promoter Sets Locally over-represented sequence motifs: • TATA-box: site selector element, occurs around position –27, estimated frequency in human promoters: 64%. • Initiator: site selector element, presumably occurs exactly at initiation site, estimated frequency in human promoters: 50%. • CCAAT-box: upstream promoter element, occurs in a large upstream region with peak frequency at –80, estimated frequency in human promoters: 23%. • GC-box: upstream promoter element, occurs in a large upstream region with peak frequency at –50, estimated frequency in human promoters: 52%. Other known sequence features: • CpG islands: regions of 200-1000 bp with a ratio of CpGobs / CpGexp > 0.6 and a C+G content > 50%, occurs around transcription initiation sites, estimated frequency based on promoters in EPD: 39%.

  30. TATA-box Profiles for Four Different Promoter Sets

  31. Initiator Profiles for Four Different Promoter Sets

  32. CCAAT-box Profiles for Four Different Promoter Sets

  33. GC-box Profiles for Four Different Promoter Sets

  34. In silico analysis of larger promoter sequence sets. The previous results have shown that in silico primer extension is accurate, perhaps even more accurate than convetnional methods. However: Was data set size really the bottleneck in promoter analysis ? Have we already gained new insights into promoter structure from analyzing larger promoter sets defined by in silico primer extension ? A recent study of about 2000 Drosophila promoters may give a preliminary answer to this question.

  35. The best conserved and most abundant Drosophila core promoter elements as found by Uwe Ohler and coworkers

  36. In particular, the most significant and undoubtedly most frequent, most conserved, and thus probably most important Drosophila promoter element corresponds to the following motif: 30 years of very intensive and expensive wet lab molecular biology research has not uncovered that motif !!!

  37. Back to Proteins: What is the protein that binds to the most important promoter of element of Drospophila ? Guesses from the audience may be sent to: Philipp.Bucher@isb-sib.ch

More Related