1 / 34

CSCE555 Bioinformatics

CSCE555 Bioinformatics. Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Download Presentation

CSCE555 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE555 Bioinformatics Lecture 11 Promoter Predication Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

  2. Outline • Introduction to DNA Motif • Motif Representations (Recap) • Motif database search • Algorithms for motif discovery

  3. Search Space Motif width = W N Length = L Size of search space = (L – W + 1)N L=100, W=15, N=10  size  1019

  4. Worked Example N = 4 pi = ¼ cki = Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29

  5. 1 Suppose the search space is a 2D rectangle. (Typically, more than 2 dimensions!) Start at a random point X. Randomly pick a dimension. 2 Look at all points along this dimension. Move to one of them randomly, proportional to its score π. X Gibbs Sampling Search Repeat.

  6. Choose a random starting state. Randomly pick a sequence. Look at all motif positions in this sequence. Pick one randomly proportional to exp(score). Gibbs Sampling for Motif Search Repeat.

  7. Does it Work in Practice? • Only successful cases get published! • Seems more successful in microbes (bacteria & yeast) than in animals. • The search algorithm seems to work quite well, the problem is the scoring scheme: real motifs often don’t have higher scores than you would find in random sequences by chance. I.e. the needle looks like hay. • Attempts to deal with this: • Assume the motif is an inverted palindrome (they often are). • Only analyze sequence regions that are conserved in another species (e.g. human vs. mouse). • As usual, repetitive sequences cause problems. • More powerful algorithm: MEME

  8. Go to our MEME server: • http://molgen.biol.rug.nl/meme/website/meme.html • Fill in your emailadres, description of the sequences • Open the fasta formatted file you just saved with Genome2d (click “Browse”) • Select the number of motifs, number of sites and the optimum width of the motif • Click “Search given strand only” • Click “Start search”

  9. Something like this will appear in your email. The results are quite self explanatory.

  10. Promoter Prediction • What are promoters? • Three strategies for promoter prediction • Signal based • Comparative genomics/phylogenetic footprinting • Expression profile base de-novo motif discovery algorthms

  11. What is a Promoter? Region of gene that binds RNA polymerase and transcription factors to initiate transcription

  12. Promoters:Whatsignals are there? Simple ones in prokaryotes

  13. Prokaryotic promoters • RNA polymerase complex recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site • RNA polymerase complexbinds directly to these. with no requirement for “transcription factors” • Prokaryotic promoter sequences are highly conserved • -10 region • -35 region

  14. What signals are there? Complex ones in eukaryotes

  15. Eukaryotic genes are transcribed by 3 different RNA polymerases Recognize different types of promoters & enhancers:

  16. Eukaryotic promoters & enhancers • Promoters located “relatively” close to initiation site (but can be located within gene, rather than upstream!) • Enhancers also required for regulated transcription (these control expression in specific cell types, developmental stages, in response to environment) • RNA polymerase complexes do not specifically recognize promoter sequences directly • Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes

  17. Eukaryotic transcription factors • Transcription factors (TFs) are DNA binding proteins that also interact with RNA polymerase complex to activate or repress transcription • TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039 • TFs recognize specific short DNA sequence motifs “transcription factor binding sites” • Several databases for these, e.g.TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac

  18. Zinc finger-containing transcription factors • Common in eukaryotic proteins • Estimated 1% of mammalian genes encode zinc-finger proteins • In C. elegans, there are 500! • Can be used as highly specific DNA binding modules • Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy

  19. Predicting Promoters • Overview of strategies •  What sequence signals can be used? • What other types of information can be used? • Algorithms • Promoter prediction software • 3 major types • many, many programs

  20. Promoter prediction: Eukaryotes vs prokaryotes Promoter prediction is easier in microbial genomes Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously, again mostly HMM-based Now: • similarity-based. • comparative methods (because so many genomesavailable) • De novo motif discovery

  21. Predicting promoters: Steps & Strategies • Closely related to gene prediction • Obtain genomic sequence • Use sequence-similarity based comparison • (BLAST, MSA) to find related genes • But: "regulatory" regions are much less well-conserved than coding regions • Locate ORFs • Identify TSS (if possible!) • Use promoter prediction programs • Analyze motifs, etc. in sequence(TRANSFAC) FirstEF

  22. Automated promoter prediction strategies • Pattern-driven algorithms • Sequence-similarity based algorithms • Combined "evidence-based" • BEST RESULTS? Combined, sequential

  23. 1: Promoter Prediction: Pattern-driven algorithms • Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO) • Tend to produce huge numbers of FPs • Why? • Binding sites (BS) for specific TFs often variable • Binding sites are short (typically 5-15 bp) • Interactions between TFs (& other proteins) influence affinity & specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to organism/cell/stage/environmental condition

  24. Solutions to problem of too many FP predictions? • Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common • Prokaryotes: knowledge of  factors helps • Probability of "real" binding site increases if annotated transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!) & Only a small fraction of TSSs have been experimentally mapped • CpG islands before promoter around TSS • TATA Box, CCAAT box • Content Information: hexamer frequency

  25. Why we cannot rely on consensus sequence? • Inr (Initiator) consensus sequence will appear once every 512bp in random sequences • For TATA box, one for every 120bp • Short-sequence patterns can appear by chance with high likelihood (false postives)

  26. 2: Promoter Prediction: PhylogeneticFootprinting • Assumption: common functionality can be deduced from sequence conservation • Comparative promoter prediction: "Phylogenetic footprinting rVista, ConSite, PromH, FootPrinter • For comparative (phylogenetic) methods • Must choose appropriate species • Different genomes evolve at different rates • Classical alignment methods have trouble with translocations, inversions in order of functional elements • If background conservation of entire region is highly conserved, comparison is useless • Not enough data (Prokaryotes >>> Eukaryotes) • Biology is complex: many (most?) regulatory elements are not conserved across species!

  27. 3: Promoter Prediction: Co-expression based algorithms Problems: • Need sets of co-regulated genes • Genes experimentally determined to be co-regulated (using microarrays??) Careful: How determine co-regulation? • Alignments of co-regulated genes should highlight elements involved in regulation Algorithms: MEME AlignACE, PhyloCon

  28. Examples of promoter prediction/characterization software MATCH, MatInspector TRANSFAC MEME & MAST BLAST, etc. Others? FIRST EF Dragon Promoter Finder (these are links in PPTs) also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc) JASPAR

  29. TRANSFAC matrix entry: for TATA box • Fields: • Accession & ID • Brief description • TFs associated with this entry • Weight matrix • Number of sites used to build (How many here?) • Other info

  30. Global alignment of human & mouse obese gene promoters (200 bp upstream from TSS)

  31. Check out optional review & try associated tutorial: Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html Check this out: http://www.phylofoot.org/NRG_testcases/ D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)

  32. Annotated lists of promoter databases & promoter prediction software • URLs from Mount Chp 9, available online Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html • Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm • URLs for Baxevanis & Ouellette, Chp 5: http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links More lists: • http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter • http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104 • http://www3.oup.co.uk/nar/database/subcat/1/4/

  33. Summary • Promoter & gene regulation • 3 types of methods for promoter prediction • Many programs have sensitivity and specificity less than 0.5 • Integrative algorithms are more promising

  34. Acknowledgement • Zhiping Weng (Boston Uni.)

More Related