270 likes | 436 Views
Srr-1 from Streptococcus. Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) s serine (polar uncharged). Streptococcal Srr proteins S, signal sequence N, non-repeat region
E N D
i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) s serine (polar uncharged)
Streptococcal Srr proteins S, signal sequence N, non-repeat region RI, small repeat region I RII, large repeat region II A, cell wall sorting signal (X)S, di-peptide repeat motif.
Gene prediction sequence
Prokaryotic gene • “Small” genomes, high gene density • Haemophilus influenza genome 85% genic • Operons • One transcript, many genes • No introns • One gene, one protein • Open reading frames • One ORF per gene • ORFs begin with start, end with stop codon
Eukaryotic Gene • Much lower gene density • Undergo several post transcriptional modifications. • 5’ CAP • Poly A tail • Splicing
Goal of Genomics • To understand the function of every gene in an organism • 1. Sequence the genome • 2. Characterize each gene • Some are already known • Many are similar to known genes • 40% are unknown (no homolog characterized)
Domain databases (Interpro, CDD) PFAM, ProDom Smart, Prints Prosite, TIGRfam Genome assembly supporting evidence Gene/Locus NM_00001 (mRNA) XM_00001 (predicted mRNA) Gene prediction Pubmed Unigene Omim Homology maps Human mutation db Gene/Protein info Supporting evidence Exon/intron structure Ancilliary databases Reference sequences (REFSEQ) Genome Browsers (Ensembl, UCSC, NCBI) LocusLink/Gene Collating the evidence DNA databases (EMBL/Genbank/DDBJ) Protein databases (Swall) TrEMBL (automatic translation of CDS from DNA db’s) Swissprot (curated data) mRNA (cDNA) Genomic (finished, draft) dbEST (ESTs)
Genome Browsers • Ensembl: www.ensembl.org • EBI and Sanger collaboration • Gene build, predict novel genes • UCSC: genome.ucsc.edu • University of Santa Cruz • Annotate other gene builds • NCBI: www.ncbi.nlm.nih.gov/mapview/ • NCBI map viewer • Gene build, predicts novel genes
Predicting genes Open Reading Frames (ORFs) freqency of stop codons simple algorithm, easy to interpret Composition bias coding vs. noncoding Sequence Signals enhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…
Predicted genes are of 4 types • Known genes (highest quality) • as catalogued by the reference sequence project • Ensembl known genes (red genes) • NCBI known genes • Novel genes (1) (high quality) • based on similarity to known genes, or cDNAs • these need not have 100% matching supporting evidence • Ensembl novel genes (black) • NCBI Loc genes
Predicted genes are of 4 types • Novel genes (2) (high quality) • based on the presence of ESTs • resource of alternative splicing • EST genes in Ensembl (purple) • Database of transcribed sequences (DOTs) • Assembly • Ab initio gene prediction (questionable) • Single organsism: Genscan • Comparative information: Twinscan • Pseudogenes - matches a known gene but with a • a disrupted ORF - a minefield!
Gene prediction programs • Ab initio gene prediction • First ones predicted single exons, e.g. GRAIL (Uberbacher, ‘91) or MZEF (Zhang, ‘97) • Later, predict entire genes e.g. Genscan (Burge ‘97) and Fgenesh (Solovyev, ‘95) • Predict individual exons based on codon usage and sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes • Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00) • Can not use gene prediction methods alone to accurately identify every gene in a genome
Twinscan Gene structure prediction model Extends probability model of GENSCAN Exploits homology between two related genomes Notable improvement on GENSCAN
Prediction of URO-D structure using GRAIL and an external EST database
Prediction of URO-D structure using GENEWISE and different species as targets
Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output
reverse transcription cDNA Expressed Sequence Tag (EST) full length cDNA sequence Supporting evidence mRNA
Measuring accuracy • Sn = Sensitivity = TP/(TP+FN) • How many exons were found out of total present? • Sp = Specificity = TP/(TP+FP) • How many predicted exons were correct out of total exons predicted?
Why the errors? First exons tend to be short so there is less information to use. Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance. EST libraries contaminated with genomic sequences Pseudogenes - test rate of synonymous substitutions (stops are more rare)
Other sources of gene prediction • ORF detectors • NCBI: http://www.ncbi.nih.gov/gorf/gorf.html *** • Promoter predictors • CSHL: http://rulai.cshl.org/software/index1.htm • BDGP: fruitfly.org/seq_tools/promoter.html • ICG: TATA-Box predictor • PolyA signal predictors • CSHL: argon.cshl.org/tabaska/polyadq_form.html • Splice site predictors • BDGP: http://www.fruitfly.org/seq_tools/splice.html • Start-/stop-codon identifiers • DNALC: Translator/ORF-Finder • BCM: Searchlauncher