1 / 68

BCB 444/544

BCB 444/544. Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29. Required Reading ( before lecture). Mon Oct 29 - Lecture 28 Promoter & Regulatory Element Prediction Chp 9 - pp 113 - 126 Wed Oct 30 - Lecture 29 Phylogenetics Basics Chp 10 - pp 127 - 141

rhian
Download Presentation

BCB 444/544

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  2. Required Reading (before lecture) MonOct 29- Lecture 28 Promoter &Regulatory Element Prediction • Chp 9 - pp 113 - 126 Wed Oct 30 - Lecture 29 Phylogenetics Basics • Chp 10 - pp 127 - 141 Thurs Oct 31 - Lab 9 Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 29 Phylogenetic Tree Construction Methods & Programs • Chp 11 - pp 142 - 169 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  3. Assignments & Announcements Mon Oct 29 - HW#5 - will be posted today HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 (not Fri Nov 1 as previously posted) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  4. BCB 544 "Team" Projects Last week of classes will be devoted to Projects • Written reports due: • Mon Dec 3(no class that day) • Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period • See Guidelines for Projects posted online BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  5. BCB 544 Only: New Homework Assignment 544 Extra#2 Due: √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  6. Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html • Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB • Todd YeatesUCLATBA -something cool about structure and evolution? • Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI • Bob Jernigan BBMB, ISU • Control of Protein Motions by Structure BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  7. Chp 8 - Gene Prediction SECTION IIIGENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction • Categories of Gene Prediction Programs • Gene Prediction in Prokaryotes • Gene Prediction in Eukaryotes BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  8. Computational Gene Prediction: Approaches • Ab initio methods • Search by signal: find DNA sequences involved in gene expression • Search by content: Test statistical properties distinguishing coding from non-coding DNA • Similarity-based methods • Database search: exploit similarity to proteins, ESTs, cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  9. Computational Gene Prediction: Algorithms This is a new slide • Neural Networks (NNs)(more on these later…) e.g., GRAIL • Linear discriminant analysis (LDA)(see text) e.g., FGENES, MZEF • Markov Models (MMs) & Hidden Markov Models (HMMs) e.g., GeneSeqer - uses MMs GENSCAN - uses 5th order HMMs - (see text) HMMgene - uses conditional maximum likelihood (see text) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  10. Signals Search This is a new slide Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  11. Content Search This is a new slide Observation: Encoding a protein affects statistical properties of DNA sequence: • Nucleotide.amino acid distribution • GC content (CpG islands, exon/intron) • Uneven usage of synonymous codons (codon bias) • Hexamer frequency - most discriminative of these for identifying coding potential Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  12. Human Codon Usage This is a new slide BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  13. Predicting Genes based on Codon Usage Differences Exons Coding Profile of ß-globin gene This is a new slide Algorithm: Process sliding window • Use codon frequencies to compute probability of coding versus non-coding • Plot log-likelihood ratio: BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  14. Similarity-Based Methods: Database Search ATTGCGTAGGGCGCT TAACGCATCCCGCGA This is a new slide In different genomes:Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.) Within same genome: Search with EST/cDNA database (EST2genome, BLAT, etc.). Problems: • Will not find “new” or RNA genes (non-coding genes). • Limits of similarity are hard to define • Small exons might be overlooked BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  15. Similarity-Based Methods: Comparative Genomics human mouse GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA- This is a new slide Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene Advantages: • May find uncharacterized or RNA genes Problems: • Finding suitable evolutionary distance • Finding limits of high similarity (functional regions) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  16. This is a new slide Human Mouse Human-Mouse Homology • Comparison of 1196 orthologous genes • Sequence identity between genes in human vs mouse • Exons: 84.6% • Protein: 85.4% • Introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  17. Thanks to Volker Brendel, ISU for the following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB V Brendel vbrendel@iastate.edu Brendel et al (2004)Bioinformatics 20: 1157 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  18. Spliced Alignment Algorithm Intron GT AG Donor Acceptor Splice sites GeneSeqer- Brendel et al.- ISU http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel et al (2004)Bioinformatics 20: 1157 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/7/1157 • Perform pairwise alignment with large gaps in one sequence (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions • Using Bayesian probability model & 1st order MM • Score coding constraints in translated exons • Using Bayesian model Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  19. Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site Ī: avg sample standard deviation of Ī Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  20. Human T2_GT Human T2_AG Information Content vs Position Which sequences are exons & which are introns? How can you tell? Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  21. PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  22. This is a new slide Evaluation of Splice Site Prediction TP = positive instance correctly predicted as positive FP = negative instance incorrectly predicted as positive TN = negative instance correctly predicted as negative FN = positive instance incorrectly predicted as negative Right! Fig 5.11 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  23. Actual True False TP FP PP=TP+FP True Predicted FN TN False PN=FN+TN AP=TP+FN AN=FP+TN • Specificity: • Sensitivity: • Misclassification rates: • Normalized specificity: Evaluation of Predictions Predicted Positives True Positives False Positives Coverage Recall Do not memorize this! BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  24. Actual True False TP FP PP=TP+FP True Predicted FN TN False PN=FN+TN AP=TP+FN AN=FP+TN • Sensitivity: • Specificity: Evaluation of Predictions - in English = Coverage IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be achieved trivially by labeling all test cases positive! In English? Sensitivity is the fraction of all positive instances having a true positive prediction. = Recall IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value") In English? Specificity is the fraction of all predicted positives that are, in fact, true positives. BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  25. This slide has been changed Best Measures for Comparison? • ROC curves(Receiver Operating Characteristic (?!!) • http://en.wikipedia.org/wiki/Roc_curve • Correlation Coefficient • Matthews correlation coefficient (MCC) • MCC = 1 for a perfect prediction • 0 for a completely random assignment • -1 for a "perfectly incorrect" prediction In signal detection theory, a receiver operating characteristic (ROC),or ROC curve is aplot of sensitivity vs (1 - specificity)for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently byplotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate) Do not memorize this! BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  26. GeneSeqer: Input http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  27. GeneSeqer: Output Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  28. GeneSeqer: Gene Evidence Summary Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  29. Gene Prediction - Problems & Status? Common errors? • False positive intergenic regions: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • Partially incorrect gene annotation • Missing annotation of alternative transcripts Current status? • For ab initio prediction in eukaryotes:HMMs have better overall performance for detecting intron/exon boundaries • Limitation? Training data: predictions are organism specific • Combined ab initio/homology based predictions: Improved accurracy • Limitation? Availability of identifiable sequence homologs in databases BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  30. Recommended Gene Prediction Software • Ab initio • GENSCAN:http://genes.mit.edu/GENSCAN.html • GeneMark.hmm:http://exon.gatech.edu/GeneMark/ • others: GRAIL, FGENES, MZEF, HMMgene • Similarity-based • BLAST, GenomeScan, EST2Genome, Twinscan • Combined: • GeneSeqer, http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi • ROSETTA • Consensus:because results depend on organisms & specific task, Always use more than one program! • Two servers hat report consensus predictions • GeneComber • DIGIT BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  31. Other Gene Prediction Resources: at ISU http://www.bioinformatics.iastate.edu/bioinformatics2go/ BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  32. Other Gene Prediction Resources: GaTech, MIT, Stanford, etc. Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!) Chapter 4 Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  33. Chp 9 - Promoter & Regulatory Element Prediction SECTION IIIGENE AND PROMOTER PREDICTION Xiong: Chp 9 Promoter & Regulatory Element Prediction • Promoter & Regulatory Elements in Prokaryotes • Promoter & Regulatory Elements in Eukaryotes • Prediction Algorithms BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  34. Eukaryotes vs Prokaryotes: Genomes Eukaryotic genomes • Are packaged in chromatin & sequestered in a nucleus • Are larger and have multiple linear chromosomes • Contain mostly non-protein coding DNA (98-99%) Prokarytic genomes • DNA is associated with a nucleoid, but no nucleus • Much larger, usually single, circular chromosome • Contain mostly protein encoding DNA BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  35. Eukaryotes vs Prokryotes: Gene Structure BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  36. Eukaryotes vs Prokaryotes: Genes Eukaryotic genes • Are larger and more complex than in prokaryotes • Contain introns that are “spliced” out to generate mature mRNAs* • Often undergo alternative splicing, giving rise to multiple RNAs* • Are transcribed by 3 different RNA polymerases (instead of 1, as in prokaryotes) * In biology, statements such as this include an implicit “usually” or “often” BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  37. Eukaryotes vs Prokaryotes: Levels ofGene Regulation Primary level of control? • Prokaryotes: Transcription initiation • Eukaryotes: Transcription is also very important, but • Expression is regulated at multiple levels many of which are post-transcriptional: • RNA processing, transport, stability • Translation initiation • Protein processing, transport, stability • Post-translational modification (PTM) • Subcellular localization Recent important discoveries: small regulatory RNAs (miRNA, siRNA) are abundant and play very important roles in controlling gene expression in eukaryotes, often at post-transcriptional levels BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  38. Eukaryotes vs Prokaryotes:RegulatoryElements • Prokaryotes: • Promoters & operators (for operons) - cis-acting DNA signals • Activators & repressors - trans-acting proteins • (we won't discuss these…) • Eukaryotes: • Promoters & enhancers (for single genes) - cis-acting • Transcription factors - trans-acting • Important difference? • What the RNA polymerase actually binds BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  39. Prokaryotic Promoters • RNA polymerase complex recognizes promoter sequences located very close to and on 5’ side (“upstream”) of tansription initiation site • Prokaryotic RNA polymerase complexbinds directly to promoter, by virtue of its sigma subunit - no requirement for “transcription factors” binding first • Prokaryotic promoter sequences are highly conserved: • -10 region • -35 region BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  40. Eukaryotic Promoters • Eukaryotic RNA polymerase complexes do not bind directly to promoter sequences • Transcription factors must bind first and serve as landmarks recognized by RNA polymerase complexes • Eukaryotic promoter sequences are less highly conserved, but many promoters (for RNA polymerase II) contain : • -30 region "TATA" box • -100 region "CCAAT" box BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  41. Eukaryotic Promoters vs Enhancers Both promoters & enhancers are binding sites for transcription factors (TFs) • Promoters • essential for initiation of transcription • located “relatively” close to start site (usually <200 bp upstream, but can be located within gene, rather than upstream!) • Enhancers • needed for regulated transcription (differential expression in specific cell types, developmental stages, in response to environment, etc.) • can be very far from start site (sometimes > 100 kb) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  42. Eukaryotic genes are transcribed by 3 different RNA polymerases(Location of promoter regions, TFBSs & TFs differ, too) rRNA mRNA tRNA, 5S RNA Brown Fig 9.18 BIOS Scientific Publishers Ltd, 1999 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  43. Prokaryotic Genes & Operons • Genes with related functions are often clustered within operons (e.g., lac operon) • Operons = genes with related functions that are transcribed and regulated as a single unit; one promoter controls expression of several proteins • mRNAs produced from operons are “polycistronic” - a single mRNA encodes several proteins; i.e., there are multiple ORFs, each with its own AUG (START) & STOP codons, linked within one mRNA molecule BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  44. Promoter of lac operon in E. coli (Transcribed by prokaryotic RNA polymerase) Brown Fig 9.17 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BIOS Scientific Publishers Ltd, 1999

  45. Eukaryotic genes • Genes with related functions are occasionally, but not usually clustered; instead, they share common regulatory regions (promoters, enhancers, etc.) • Chromatin structure must also be “active” for transcription to occur BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  46. Eukaryotic genes have large & complex regulatory regions • Cis-actingregulatory elements include: • Promoters, enhancers, silencers • Trans-acting regulatory factors include: • Transcription factors (TFs), chromatin • remodeling complexes, small RNAs Brown Fig 9.17 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BIOS Scientific Publishers Ltd, 1999

  47. Eukaryotic Promoters: DNA sequences required for initiation, usually <200 bp from start site Eukaryotic RNA polymerases bind by recognizing a complex of TFs bound at promotor First, TFs must bind short motifs (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA ~250 bp Pre-mRNA BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  48. Eukaryotic promoters & enhancer regions often contain many different TFBS motifs Fig 9.13 Mount 2004 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  49. Simplified View of Promoters in Eukaryotes Fig 5.12 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

  50. Eukaryotic Activators vs Repressors RNAP promoter enhancer Gene 100 - 50,000 bp enhancer proteins interact with RNAP repressor transcription repressor prevents binding of activator Regions far from the promoter can act as "enhancers" or "repressors" of transcription by serving as binding sites for activator or repressor proteins (TFs) Activator proteins (TFs) bind to enhancers & interact with RNAP to stimulate transcription Repressors block the action of activators BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction

More Related