1 / 27

Srr-1 from Streptococcus

Srr-1 from Streptococcus. Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) s serine (polar uncharged). Streptococcal Srr proteins S, signal sequence N, non-repeat region

anson
Download Presentation

Srr-1 from Streptococcus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Srr-1 from Streptococcus

  2. Srr-1 from Streptococcus

  3. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) s serine (polar uncharged)

  4. Streptococcal Srr proteins S, signal sequence N, non-repeat region RI, small repeat region I RII, large repeat region II A, cell wall sorting signal (X)S, di-peptide repeat motif.

  5. Gene prediction sequence

  6. Prokaryotic gene • “Small” genomes, high gene density • Haemophilus influenza genome 85% genic • Operons • One transcript, many genes • No introns • One gene, one protein • Open reading frames • One ORF per gene • ORFs begin with start, end with stop codon

  7. Eukaryotic Gene • Much lower gene density • Undergo several post transcriptional modifications. • 5’ CAP • Poly A tail • Splicing

  8. Goal of Genomics • To understand the function of every gene in an organism • 1. Sequence the genome • 2. Characterize each gene • Some are already known • Many are similar to known genes • 40% are unknown (no homolog characterized)

  9. Domain databases (Interpro, CDD) PFAM, ProDom Smart, Prints Prosite, TIGRfam Genome assembly supporting evidence Gene/Locus NM_00001 (mRNA) XM_00001 (predicted mRNA) Gene prediction Pubmed Unigene Omim Homology maps Human mutation db Gene/Protein info Supporting evidence Exon/intron structure Ancilliary databases Reference sequences (REFSEQ) Genome Browsers (Ensembl, UCSC, NCBI) LocusLink/Gene Collating the evidence DNA databases (EMBL/Genbank/DDBJ) Protein databases (Swall) TrEMBL (automatic translation of CDS from DNA db’s) Swissprot (curated data) mRNA (cDNA) Genomic (finished, draft) dbEST (ESTs)

  10. Genome Browsers • Ensembl: www.ensembl.org • EBI and Sanger collaboration • Gene build, predict novel genes • UCSC: genome.ucsc.edu • University of Santa Cruz • Annotate other gene builds • NCBI: www.ncbi.nlm.nih.gov/mapview/ • NCBI map viewer • Gene build, predicts novel genes

  11. Predicting genes Open Reading Frames (ORFs) freqency of stop codons simple algorithm, easy to interpret Composition bias coding vs. noncoding Sequence Signals enhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…

  12. Predicted genes are of 4 types • Known genes (highest quality) • as catalogued by the reference sequence project • Ensembl known genes (red genes) • NCBI known genes • Novel genes (1) (high quality) • based on similarity to known genes, or cDNAs • these need not have 100% matching supporting evidence • Ensembl novel genes (black) • NCBI Loc genes

  13. Predicted genes are of 4 types • Novel genes (2) (high quality) • based on the presence of ESTs • resource of alternative splicing • EST genes in Ensembl (purple) • Database of transcribed sequences (DOTs) • Assembly • Ab initio gene prediction (questionable) • Single organsism: Genscan • Comparative information: Twinscan • Pseudogenes - matches a known gene but with a • a disrupted ORF - a minefield!

  14. Gene prediction programs • Ab initio gene prediction • First ones predicted single exons, e.g. GRAIL (Uberbacher, ‘91) or MZEF (Zhang, ‘97) • Later, predict entire genes e.g. Genscan (Burge ‘97) and Fgenesh (Solovyev, ‘95) • Predict individual exons based on codon usage and sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes • Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00) • Can not use gene prediction methods alone to accurately identify every gene in a genome

  15. Twinscan Gene structure prediction model Extends probability model of GENSCAN Exploits homology between two related genomes Notable improvement on GENSCAN

  16. Output from Artemis

  17. Bias in nucleotide frequency

  18. Prediction of URO-D structure using different programs

  19. Prediction of URO-D structure using GRAIL and an external EST database

  20. Prediction of URO-D structure using GENEWISE and different species as targets

  21. Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output

  22. reverse transcription cDNA Expressed Sequence Tag (EST) full length cDNA sequence Supporting evidence mRNA

  23. Measuring accuracy • Sn = Sensitivity = TP/(TP+FN) • How many exons were found out of total present? • Sp = Specificity = TP/(TP+FP) • How many predicted exons were correct out of total exons predicted?

  24. Twinscan

  25. Why the errors? First exons tend to be short so there is less information to use. Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance. EST libraries contaminated with genomic sequences Pseudogenes - test rate of synonymous substitutions (stops are more rare)

  26. Other sources of gene prediction • ORF detectors • NCBI: http://www.ncbi.nih.gov/gorf/gorf.html *** • Promoter predictors • CSHL: http://rulai.cshl.org/software/index1.htm • BDGP: fruitfly.org/seq_tools/promoter.html • ICG: TATA-Box predictor • PolyA signal predictors • CSHL: argon.cshl.org/tabaska/polyadq_form.html • Splice site predictors • BDGP: http://www.fruitfly.org/seq_tools/splice.html • Start-/stop-codon identifiers • DNALC: Translator/ORF-Finder • BCM: Searchlauncher

More Related