170 likes | 186 Views
This document provides an in-depth explanation of the (semi)-automated training process of Geneid, a protein-coding gene prediction tool, using the example of Melon. It covers the steps involved in training, optimization, and evaluation, along with the advantages and disadvantages of manual training. The document also discusses the potential future improvements and the possibility of making Geneid into a user-friendly software package.
E N D
(SEMI)-Automated TRAINING OF GENEID:Melon EXAMPLE Francisco Câmara Ferreira Group meeting, June 2011
Geneid: • Geneid is a protein-coding gene prediction tool: can be optimized for prediction in different species • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential • Dynamic programming algorythm: maximize score of assembled exons -> assembled gene • Training: • compute a model for splice sites (PWMs/Markov) • Derive model for coding DNA
Parameter file development: • Currently geneid (v.1.4) has 46 parameter files. http://genome.crg.eu/software/geneid/index.html
Training geneid “manually” Disadvantages: -required running 38 awk/Perl/C programs (often one after other) -easier to make mistakes at different steps -could take 2-3 days incl. evaluation Advantages: -more “control” over training and optimization could potentially generate a better matrix
GeneidTRAINer1_0.pl(a PERL-language integration tool) Twenty-three awk scripts -Scripts to derive coding potential (i.e. MarkovMatrices.awk) -Scripts to compute PWMs or Markov models of splice sites (i.e. logratio_kmatrix.awk; Getkmatrix.awk) Five C-language programs Geneid SSgff Evaluation
GeneidTRAINer1_0.pl Command line options: -Species (C. melo) -gff (gff2 format) -fasta (multifasta annots) -sout (stats file name) -branch (meme branch profile: y/n) -reduced (“reduced” training: y/n) {-path (location of programs/scripts)}
Modules used by GeneidTRAINer1_0.pl: use strict; use Getopt::Long; use File::Path; use Data::Dumper; use Geneid::Param; use Geneid::Isocore;
Full/partial training First time training for a species command line: • “REDUCED” training excludes de following: • Setting aside of sequences for testing (if >500 genes) • Select whether to perform 10x cross-validation (if >500 genes) • Extracting introns and CDS • Extract splice sites and start codons • Extract (400nt) flanked gene models and build “artificial contig” by concatenating gene models (training/test) • Extract random “background sequences” geneidTRAINer1_0.pl –species C.melo –gff c.melo.gff –fastas cmelo.fa -sout stats.txt –branch no –reduced no Subsequent training for a species command line: geneidTRAINer1_0.pl –species C.melo –gff c.melo.gff –fastas cmelo.fa -sout stats.txt –branch no –reduced yes
“REDUCED” training would exclude all steps shown in this slide start user interactivity/program flow (I) Branch=no Reduced=no 10X cross-validation? (1) yes yes -Extract CDS/introns/sites -Error checking (Eval: 10x cross-val + test set) Set aside 20% for test set? yes >500 gene models? no no no (2) (3) -Extract CDS/introns/sites -Error checking (Eval: test set) -Extract CDS/introns/sites -Error checking (Eval: 10x cross-val)
user interactivity/program flow (II) Display recommended 1) donor, 2) acceptor, 3) start, 4) (branch) profile Modify profile? Compute PWM or Markov on new profile yes no Compute PWM or Markov on suggested profile
user interactivity/program flow (III) > 400,000 coding /100,000 non-coding bases? Derive coding model of order 5 yes no Derive coding model of order 4
user interactivity/program flow (IV) Display recommended 1) intron range (min/max) / 2) inter-genic distance range Modify gene model? Modify gene model with new range yes no Use default intron/intergenic range
user interactivity/program flow (V) Optimize/evaluate on “artificial” contigs? yes Display default eWF/ oWF range/step (optimization) no Optimize parameter file using new eWF/oWF range Display default eWF/ oWF range/step (optimization) Modify optimization range? yes no yes Modify optimization range? Optimize parameter file using new eWF/oWF range EVALUATE (on test set-contig or on training set-contig (if <500 gene models) –maybe biased Optimize parameter file using default eWF/oWF range no Optimize parameter file using default eWF/oWF range EVALUATE: 1) test set-single seqs + 10x cross-validation ; 2) test set-single seqs; 3) 10x cross-validation (<500 sequences)
user interactivity/program flow (VI) Plot annotations + geneid predictions using gff2ps? Plot predictions + END PROGRAM yes no END PROGRAM
Things (still) to do: • Convert awk scripts/programs to perl (python?) to get a “cleaner”, easier to use, software tool • Write better usage instructions… • Make it into a “package” (including geneid and other essential programs source code?) that can be easily installed by users interested in training geneid without having to have much knowledge of the training process itself.. • Perhaps try to publish to a technical journal?