460 likes | 604 Views
10/24/05 Promoter Prediction RNA Structure & Function Prediction. Announcements. Seminar (Mon Oct 24) (several additional seminars listed in email sent to class) 12:10 PM IG Faculty Seminar in 101 Ind Ed II
E N D
10/24/05Promoter PredictionRNA Structure & FunctionPrediction D Dobbs ISU - BCB 444/544X: Promoter Prediction
Announcements Seminar (Mon Oct 24) (several additional seminars listed in email sent to class) 12:10 PMIG Faculty Seminar in 101 Ind Ed II "Laser capture microdissection-facilitated transcriptional profiling of abscission zones in Arabidopsis" Coralie Lashbrook, EEOB http://www.bb.iastate.edu/%7Emarit/GEN691.html Mark your calendars: 1:10 PM Nov 14Baker Seminar in Howe Hall Auditorium "Discovering transcription factor binding sites" Douglas Brutlag,Dept of Biochemistry & Medicine, Stanford University School of Medicine D Dobbs ISU - BCB 444/544X: Promoter Prediction
Announcements • 544 Semester Projects • Thanks to all who sent already! • Others: Information needed today! • ddobbs@iastate.edu • Briefly describe: • Your background & current grad research • Is there a problem related to your research you would like to learn more about & develop as project for this course? • or • What would your ‘dream’ project be? D Dobbs ISU - BCB 444/544X: Promoter Prediction
Announcements Exam 2 - this Friday Posted Online:Exam 2 Study Guide 544 Reading Assignment (2 papers) Office Hours: David Mon 1-2 PM in 209 Atanasoff Drena Tues 10-11AM in 106 MBB Michael - none this week Thurs No Lab - Extra Office Hrs instead: David 1-3 PM in 209 Atanasoff Drena 1-3 PM in 106 MBB D Dobbs ISU - BCB 444/544X: Promoter Prediction
Announcements • Updated PPTs & PDFs for Gene Prediction lectures (covered on Exam 2) will be posted today (changes are minor) • Is everyone on BCB 444/544 mailing list? Auditors? D Dobbs ISU - BCB 444/544X: Promoter Prediction
Promoter Prediction & RNA Structure/Function Prediction Mon Quite a few more words re: Gene prediction Promoter prediction WedRNA structure & function RNA structure prediction 2' & 3' structure prediction miRNA & target prediction Thurs No Lab Fri Exam 2 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Reading Assignment - previous • Mount Bioinformatics • Chp 9Gene Prediction & Regulation • pp 361-401 • Ck Errata:http://www.bioinformaticsonline.org/help/errata2.html • * Brown Genomes 2 (NCBI textbooks online) • Sect 9 Overview: Assembly of Transcription Initiation Complex • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.chapter.7002 • Sect 9.1-9.3 DNA binding proteins, Transcription initiation • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.section.7016 • *NOTEs: Don’t worry about the details!! • See Study Guide for Exam 2 re:Sections covered D Dobbs ISU - BCB 444/544X: Promoter Prediction
Optional - but very helpful reading: (that's a hint!) • Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3:698-709 http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html • Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html Check this out: http://www.phylofoot.org/NRG_testcases/ 03489059922 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Reading Assignment (for Wed) • Mount Bioinformatics • Chp 8 Prediction of RNA Secondary Structure • pp. 327-355 • Ck Errata:http://www.bioinformaticsonline.org/help/errata2.html • Cates (Online) RNA Secondary Structure Prediction Module • http://cnx.rice.edu/content/m11065/latest/ D Dobbs ISU - BCB 444/544X: Promoter Prediction
Review last lecture: Gene Prediction(formerly Gene Prediction - 3) • Overview of steps & strategies • Algorithms • Gene prediction software D Dobbs ISU - BCB 444/544X: Promoter Prediction
Predicting Genes - Basic steps: • Obtain genomic DNA sequence • Translate in all 6 reading frames • Compare with protein sequence database • Also perform database similarity search • with EST & cDNA databases, if available • Use gene prediction programs to locate genes • Analyze gene regulatory sequences • Note: Several important details missing above: • 1. Mask to "remove" repetitive elements (ALUs, etc.)・ • Perform database search on translatedDNA (BlastX,TFasta) • Use several programs to predict genes (GenScan,GeneMark.hmm) • 4. Translate putative ORFs and search for functional motifs (Blocks, Motifs, etc.) & regulatory sequences D Dobbs ISU - BCB 444/544X: Promoter Prediction
Gene prediction flowchart Fig 5.15 Baxevanis & Ouellette 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Overview of gene prediction strategies • What sequence signals can be used? • Transcription:TF binding sites, promoter, initiation site, terminator • Processing signals:splice donor/acceptors, polyA signal • Translation: start (AUG = Met) & stop (UGA,UUA, UAG) • ORFs, codon usage • What other types of information can be used? • cDNAs & ESTs(pairwise alignment) • homology(sequence comparison, BLAST) D Dobbs ISU - BCB 444/544X: Promoter Prediction
Examples of gene prediction software • Similarity-based or Comparative • BLAST • SGP2 (extension of GeneID) • Ab initio = “from the beginning” • GeneID - (used in lab last week) • GENSCAN - (used in lab last week) • GeneMark.hmm - (should try this!) • Combined "evidence-based” • GeneSeqer (Brendel et al., ISU) BEST?GENSCAN, GeneMark.hmm, GeneSeqer but depends on organism & specific task D Dobbs ISU - BCB 444/544X: Promoter Prediction
Annotated lists of gene prediction software • URLs from Mount Chp 9, available online Table 9.1http://www.bioinformaticsonline.org/links/ch_09_t_1.html • from Pevsner Chps 14 & 16 http://www.bioinfbook.org/chapt14.htm - prokaryotic http://www.bioinfbook.org/chapt16.htm - eukaryotic • Table in Zhang Nat Rev Genet article: hptt://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html • Another list: Kozar, Stanford http://cmgm.stanford.edu/classes/genefind/ • Performance Evaluation? Guig�ó, Barcelona(&sites above)http://www1.imim.es/courses/SeqAnalysis/GeneIdentification/Evaluation.html D Dobbs ISU - BCB 444/544X: Promoter Prediction
Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomesavailable see Mount Fig 9.7 (E.coli gene) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available. e.g., GeneMark.hmm TIGRComprehensive Microbial Resource (CMR) NCBIMicrobial Genomes D Dobbs ISU - BCB 444/544X: Promoter Prediction
UCSC Browser view of 1000 kb region (Human URO-D gene) Fig 5.10 Baxevanis & Ouellette 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
GeneSeqer - Brendel et al. Intron GT AG Donor Acceptor Splice sites Spliced Alignment Algorithm http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi • Perform pairwise alignment with large gaps in one sequence (due to introns) • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions • Using a Bayesian model • Score coding constraints in translated exons • Using a Bayesian model Brendel et al (2004)Bioinformatics 20: 1157 Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Start codon Stop codon Genomic DNA Start codon Stop codon -Poly(A) mRNA Cap- 5’-UTR 3’-UTR Brendel - Spliced Alignment I: Compare with cDNA or EST probes Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Start codon Stop codon Genomic DNA Protein Brendel - Spliced Alignment II: Compare with protein probes Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Information Content Ii: • Extent of Splice Signal Window: Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site Ī: avg sample standard deviation of Ī Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Human T2_GT Human T2_AG Information content vs position Which sequences are exons & which are introns? How can you tell? Brendel et al (2004)Bioinformatics 20: 1157 Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Let S = s-l s-l+1 s-l+2…s-1GT s1 s2 s3 …sr Bayesian Splice Site Prediction where H indexes the hypotheses of GT or AG at - True site in reading phase 1, 2, or 0 - False within-exon site in reading phase 1, 2, or 0 - False within-intron site Brendel et al (2004)Bioinformatics 20: 1157 Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
H0: H=T 2-class model: Bayes Factor as Decision Criterion 7-class model: Brendel et al (2004)Bioinformatics 20: 1157 Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
PG PG (1-PG)(1-PD(n+1)) en en+1 (1-PG)PD(n+1) PA(n)PG (1-PG)PD(n+1) in in+1 1-PA(n) Markov Model for Spliced Alignment Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Evaluation of Splice Site Prediction Actual True False • TP • FP PP=TP+FP True Predicted • FN • TN False PN=FN+TN AP=TP+FN AN=FP+TN = Coverage • Sensitivity: • Specificity: • Misclassification rates: • Normalized specificity: Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Performance? Human GT site Human AG site Sn Sn A. thaliana AG site A. thaliana GT site Sn Sn • Note: these are not ROC curves (plots of (1-Sn) vs Sp) • But plots such as these (& ROCs) much better than using "single number" to compare different methods • Both types of plots illustrate trade-off: Sn vs Sp Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Sp = Evaluation of Splice Site Prediction What do measures really mean? Fig 5.11 Baxevanis & Ouellette 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Actual True False • TP • FP PP=TP+FP True Predicted • FN • TN False PN=FN+TN AP=TP+FN AN=FP+TN = Coverage • Sensitivity: Careful: different definitions for "Specificity" Brendel definitions • Specificity: cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN) Sp: Specificity = TN/(TN+FP) = Sp- AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1 Other measures? Predictive Values, Correlation Coefficient D Dobbs ISU - BCB 444/544X: Promoter Prediction
Best measures for comparing different methods? • ROC curves(Receiver Operating Characteristic?!!) • http://www.anaesthetist.com/mnm/stats/roc/ • "The Magnificent ROC" - has fun applets & quotes: • "There is no statistical test, however intuitive and simple, which will not be abused by medical researchers" • Correlation Coefficient • (Matthews correlation coefficient (MCC) • MCC = 1 for a perfect prediction • 0 for a completely random assignment • -1 for a "perfectly incorrect" prediction Do not memorize this! D Dobbs ISU - BCB 444/544X: Promoter Prediction
Performance of GeneSeqer vs other methods? • Comparison with ab initio gene prediction (e.g., GENESCAN) • Depends on: • Availability of ESTs • Availability of protein homologs Other Performance Evaluations? Guig�ó http://www1.imim.es/courses/SeqAnalysis/GeneIdentification/Evaluation.html Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
GeneSeqer vs GENSCAN (Exon prediction) 1.00 0.90 0.80 0.70 0.60 Exon (Sn + Sp) / 2 0.50 0.40 GeneSeqer 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
1.00 0.90 0.80 0.70 0.60 Intron (Sn + Sp) / 2 0.50 GeneSeqer 0.40 0.30 NAP 0.20 GENSCAN 0.10 0.00 0 10 20 30 40 50 60 70 80 90 100 Target protein alignment score GeneSeqer vs GENSCAN (Intron prediction) GENSCAN - Burge, MIT Brendel 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Other Resources • Current Protocols in Bioinformatics • http://www.4ulr.com/products/currentprotocols/bioinformatics.html • Finding Genes • 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations • 4.2 Using MZEF To Find Internal Coding Exons • 4.3 Using GENEID to Identify Genes • 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes • 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm • 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm • 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome • 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences • 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation • 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences D Dobbs ISU - BCB 444/544X: Promoter Prediction
New Today: Promoter Prediction • A few more words about Gene prediction • Predicting regulatory regions (focus on promoters) • Brief review promoters & enhancers • Predicting in eukaryotes vs prokaryotes • Introduction to RNA • Structure & function D Dobbs ISU - BCB 444/544X: Promoter Prediction
Predicting Promoters What signals are there? Algorithms Promoter prediction software D Dobbs ISU - BCB 444/544X: Promoter Prediction
What signals are there? Simple ones in prokaryotes Brown Fig 9.17 D Dobbs ISU - BCB 444/544X: Promoter Prediction BIOS Scientific Publishers Ltd, 1999
Prokaryotic promoters • RNA polymerase complex recognizes promoter sequences located very close to & on 5’ side (“upstream”) of initiation site • RNA polymerase complexbinds directly to these. with no requirement for “transcription factors” • Prokaryotic promoter sequences are highly conserved • -10 region • -35 region D Dobbs ISU - BCB 444/544X: Promoter Prediction
What signals are there? Complex ones in eukaryotes! Fig 9.13 Mount 2004 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Simpler view of complex promoters in eukaryotes: Fig 5.12 Baxevanis & Ouellette 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Eukaryotic genes are transcribed by 3 different RNA polymerases Recognize different types of promoters & enhancers: Brown Fig 9.18 D Dobbs ISU - BCB 444/544X: Promoter Prediction BIOS Scientific Publishers Ltd, 1999
Eukaryotic promoters & enhancers • Promoters located “relatively” close to initiation site (but can be located within gene, rather than upstream!) • Enhancers also required for regulated transcription (these control expression in specific cell types, developmental stages, in response to environment) • RNA polymerase complexes do not specifically recognize promoter sequences directly • Transcription factors bind first and serve as “landmarks” for recognition by RNA polymerase complexes D Dobbs ISU - BCB 444/544X: Promoter Prediction
Eukaryotic transcription factors • Transcription factors (TFs) are DNA binding proteins that also interact with RNA polymerase complex to activate or repress transcription • TFs contain characteristic “DNA binding motifs” http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039 • TFs recognize specific short DNA sequence motifs “transcription factor binding sites” • Several databases for these, e.g.TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac D Dobbs ISU - BCB 444/544X: Promoter Prediction
Zinc finger-containing transcription factors • Common in eukaryotic proteins • Estimated 1% of mammalian genes encode zinc-finger proteins • In C. elegans, there are 500! • Can be used as highly specific DNA binding modules • Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy Brown Fig 9.12 BIOS Scientific Publishers Ltd, 1999 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Global alignment of human & mouse obese gene promoters (200 bp upstream from TSS) Fig 5.14 Baxevanis & Ouellette 2005 D Dobbs ISU - BCB 444/544X: Promoter Prediction
Reading Assignment (for Wed) • Mount Bioinformatics • Chp 8 Prediction of RNA Secondary Structure • pp. pp. 327-355 • Ck Errata:http://www.bioinformaticsonline.org/help/errata2.html • Cates (Online) RNA Secondary Structure Prediction Module • http://cnx.rice.edu/content/m11065/latest/ D Dobbs ISU - BCB 444/544X: Promoter Prediction