210 likes | 438 Views
Geneid: training on S. lycopersicum. Francisco Câmara Ferreira. Genome Bioinformatics Research Lab Center for Genomic Regulation. Tomato Annotation, Ghent October 2006. Geneid:. Geneid follows a hierarchical structure: signal -> exon -> gene.
E N D
Geneid: training on S. lycopersicum Francisco Câmara Ferreira Genome Bioinformatics Research Lab Center for Genomic Regulation Tomato Annotation, Ghent October 2006
Geneid: • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) • Dynamic programming algorythm: maximize score of assembled exons -> assembled gene Tomato Annotation, Ghent October 2006
GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Training geneid
Optimization geneid • eWF – Exon weight parameter • cutoff of scores of predicted exons • oWF – oligo weight parameter • Ratio of info between signals & coding stats Tomato Annotation, Ghent October 2006
Training set for tomato • Used 399 of 428 non-redundant annotated genes (102 bacs) • 14 in-frame stops • 5 one-nucleotide cds • 10 redundant/overlapping • Used 1760 donor sites, 1783 acceptors and 391 start codons • 29 non-standard donors, 6 acceptors and 8 start codons Evaluation set for tomato • Used 362 of 399 genes used in training • Excluded 37 genes containing non-canonical starts, donors or acceptors • Determined prediction accuracy on this set: • sensitivity & specificity at nucleotide, exon & gene level Employed “leave-one-out” Jacknife training/evaluation method to reduce bias in accuracy results.
Statistics of training set • intron length: 515 nt (29-6972 nt) • exon length: 162 nt (2-1,888 nt) • CDS length: 970 nt (67-2,940 nt) • exons/gene: 5.5 • avg. gene size: 4,951nt • GC (coding): 43% • GC (intron): 33% • # of exons: 2,188 • # of single genes: 64 • # coding bases: 386,811 • # non-coding nts:922,178 Tomato Annotation, Ghent October 2006
Statistics of training set GC distributions Tomato Annotation, Ghent October 2006
Accuracy of new parameter file Prediction/evaluation on full BACS: • Bacs not fully annotated • Extract 400 nt.-flanked genes from annotations Tomato Annotation, Ghent October 2006
Accuracy of new parameter file Prediction/evaluation on genes Tomato Annotation, Ghent October 2006
Geneid Predictions using new parameter file • 102 BACs with annotations • 443 fully sequenced BACs (bacs.v72.seq) • Masked versions of the Bacs above • used TIGR tomato/arabidopsis known repeat sequences library (TIGR_SolAth_repeat.fa obtained from SGN) Tomato Annotation, Ghent October 2006
Accuracy of new parameter file Gff2ps plot of one prediction Tomato Annotation, Ghent October 2006
Geneid predictions on chr 9 BACs using tomato parameter file
Nucleotide ratios around splice sites/start codon in tomato: Donor: Acceptor: Start:
Most markers are in heterochromatin • Most of them did not match any BAC • Gap of 46cM Chromosome 9 has a total of 142 markers But...
Construction of a training set for gene prediction programa geneid • Waiting for a more complete • set (Shibata’s). • A parameter file constructed from 100 sequences from different Solanaceae species (50% tomato).
Geneid predictions on 6 chr 9 BACs using sol parameter file Tomato Sequencing, Madison July 2006
Geneid vs Geneseqer • Geneid prediced 22 genes in the 114,526 pb C09HBa0109D11.1 BAC • Most of the predictions are supported by ESTs results as shown by geneseqer • Geneseqer is another gene identification tool based on the “spliced alignment” of ESTs to the genomic sequence contained in the BAC
Geneid with a parameter file obtained from solanaceae applied to 6 BACs from Chr9 Tomato Sequencing, Madison July 2006
GC distribution of GC content between intron and exons in Solanaceae sequences used to train geneid • To be improved when a large set of FL from tomato is available (Shibata´s)
European Commission EU-SOL Vicky FernandezSheila ZunigaAngela PerezFrancisco CamaraRoderic GuigóMiguel A BotellaAntonio Granell