1 / 49

Gene Prediction

Gene Prediction. Preliminary Results Computational Genomics February 20, 2012. ab initio Gene Prediction. Using Glimmer3, RAST, Prodigal and GenemarkS. Prodigal. lack of complexity(no Hidden Markov Model, no Interpolated Markov Model). based on dynamic programming.

zuri
Download Presentation

Gene Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Prediction PreliminaryResults ComputationalGenomicsFebruary20, 2012

  2. ab initio Gene Prediction Using Glimmer3, RAST, Prodigal and GenemarkS

  3. Prodigal • lack of complexity(no Hidden Markov Model, no Interpolated Markov Model). • based on dynamic programming. • remains accuracy in high GC content genomes. • tends to predict longer genes rather than more genes.

  4. Prodigal Protocol

  5. Prodigal Options

  6. Build Training File

  7. Running Prodigal

  8. Screenshot of Results

  9. GeneMarkS Gene prediction in Prokaryotic genome with unsupervised model parameter estimation

  10. Web based version

  11. Command line version • Syntax: • runGeneMarkS <input_file> <output folder> • The Output folder contains 3 types of files: • .out file: contains the default output • .faa file: contains the amino acid sequence of the corresponding ORFs in FASTA format • .fnn file: contains the nucleotide sequence of the corresponding ORFs in FASTA format

  12. Screenshot of the .out file Strand +:normal strand, -:reverse strand Left end: Begin position, Right end: End position

  13. Screenshot of the .faa file

  14. Screenshot of the .fnn file

  15. Glimmer3 • A system for finding genes in microbial DNA • Works by creating a variable-length Markov model from a training set of genes • Using the model to identify all genes in a DNA sequence

  16. Running Glimmer3 • 2 step progress • 1. A probability model of coding sequences must be built called an interpolated context model. • a set of training sequences • 1. genes identified by homology or known genes • 2. from long, overlapping orfs • 3. genes from a highly similar species • 2. program is run to analyze the sequences and make gene predictions • Best results require longest possible training set of genes

  17. Glimmer3 programs • Long-orfs uses an amino-acid distribution model to filter the set of orfs • Extract builds training set from long, nonoverlappingorfs • Build-icm build interpolated context model from training sequences • Glimmer3 analyze sequences and make predictions

  18. Interpolated Context Model

  19. RAST • RAST (Rapid Annotation using Subsystem Technology) is a system for annotating bacterial and archaeal genomes. • Pipelines- tRNAScan-SE, Glimmer2, and comparing against other prokaryote genes that are universal across species.

  20. Number Genes Predicted

  21. Gene Length of Predicted Genes

  22. Homology-based Gene Prediction using BLAT

  23. Homology-based Gene Prediction using BLAT 1709 Protein coding genes Haemophilusinfluenzae Query Haemophilushaemolyticus Targets Blat-UCSC 99 17 29 24 49 31 M19107.fasta M19501.fasta M21127.fasta M21621.fasta M21639.fasta M21709.fasta Predicted genes Output.pslx QueryCoverage (%) Frequency graphs Define cutoff

  24. Cut-off Frequency Query-Coverage %

  25. Homology-based Gene Prediction using BLAT Results

  26. Gene Calling Protocol N° of Predicted Genes (≥ 90% Query-coverage) 787 1063 901 970 930 1515 Gene Scoring System M19107 M19501 Presence / Absence M21709* M21127 M21621 M21639 ? = 3/5 ≥ 4/5 ≤ 2/5 Multiple Alignment (Muscle) Final set of homology- based predicted genes Consensus Sequence

  27. RNA Prediction

  28. tRNAScan SE • First pass filters identify "candidate" tRNAregions of the sequence. • tRNAscanand EufindtRNA • Further analysis to confirm the initial tRNAprediction. • Cove

  29. Parameters passed tRNAscan-SE –B <inputfile> -o <outputfile1> -f <outputfile2> -m <outputfile3> • -B <file> : search for bacterial tRNAs • This option selects the bacterial covariace model for tRNA analysis, and loosens the search parameters for EufindtRNA to improve detection o f bacterial tRNAs. • -o <file> : save final results in <file> • Specifiythis option to write results to <file>. • -f <file> : save results and tRNAsecondary structures to <file>. • -m <file> : save statistics summary for run • contains the run options selected as well as statistics on the number of tRNAs detected at each phase of the search, search speed, and other statistics.

  30. Output using “–o” parameter Output using “–f” parameter

  31. Results Output using “–m” parameter

  32. ISOTYPE AND ANTI CODON COUNT (M19107)

  33. RNAmmer

  34. Working • It works using two level of Hidden markov models. • The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences. • Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene. • By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions.

  35. Command line options • Rnammer -S (species) –m (molecules) –xml (xml file) –gff (gff file) –h (hmm report file) –f (fasta file) • -S : specify the species to use. In out case, it will be bacterial • -m : molecules to search for. (ie. Large subunit or small subunit)

  36. Results ##gff-version2 ##source-version RNAmmer-1.2 ##date 2012-02-19 ##Type DNA # seqname source feature start end score +/- frame attribute # --------------------------------------------------------------------------------------------------------- 84 RNAmmer-1.2 rRNA 28110 31006 3556.4 + . 23s_rRNA 84 RNAmmer-1.2 rRNA 31127 31241 82.9 + . 5s_rRNA 1 RNAmmer-1.2 rRNA 116969 117083 82.9 - . 5s_rRNA 60 RNAmmer-1.2 rRNA 338 452 82.9 + . 5s_rRNA 29 RNAmmer-1.2 rRNA 198 312 82.9 + . 5s_rRNA 84 RNAmmer-1.2 rRNA 25977 27507 1872.9 + . 16s_rRNA # ---------------------------------------------------------------------------------------------------------

  37. sRNA Prediction

  38. Rfam Database Homology Search • A collection of RNA families • Non-coding RNA genes • Structured cis-regulatory elements • Self-splicing RNAs • WU-BLAST search, and keeps hits with E-value < 1e-5

  39. Rfam Preliminary Results The output format is: <rfam acc> <rfam id> <seq id> <seq start> <seq end> <strand> <score> Results: 84Rfam similarity 25970275121477.28+ . evalue=2.08e-50;gc-content=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfam-id=SSU_rRNA_bacteria

  40. Things to be done • Get Geneprimp to work since we are having some problems with the installation and the web server takes a long time to process. • Get further information required to run other RNA prediction softwares. • Compare specific RNA prediction softwares with Rfam predictions.

  41. Leading Biocomputational Tools • eQRNA (Rivas and Eddy 2001) • RNAz (Washietl et al. 2005; Gruber etal. 2010) • sRNAPredict3/SIPHT (Livny et al. 2006, 2008) • NAPP (Marchais et al. 2009) All four approaches use comparative genomics!! Lu, X., H. Goodrich-Blair, et al. (2011). "Assessing computational tools for the discovery of small RNA genes in bacteria." RNA17(9): 1635-1647

  42. sRNApredict3 Pipeline

More Related