1 / 21

Geneid: training on S. lycopersicum

Geneid: training on S. lycopersicum. Francisco Câmara Ferreira. Genome Bioinformatics Research Lab Center for Genomic Regulation. Tomato Annotation, Ghent October 2006. Geneid:. Geneid follows a hierarchical structure: signal -> exon -> gene.

lamond
Download Presentation

Geneid: training on S. lycopersicum

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Geneid: training on S. lycopersicum Francisco Câmara Ferreira Genome Bioinformatics Research Lab Center for Genomic Regulation Tomato Annotation, Ghent October 2006

  2. Geneid: • Geneid follows a hierarchical structure: signal -> exon -> gene • Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) • Dynamic programming algorythm: maximize score of assembled exons -> assembled gene Tomato Annotation, Ghent October 2006

  3. GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Training geneid

  4. Optimization geneid • eWF – Exon weight parameter • cutoff of scores of predicted exons • oWF – oligo weight parameter • Ratio of info between signals & coding stats Tomato Annotation, Ghent October 2006

  5. Training set for tomato • Used 399 of 428 non-redundant annotated genes (102 bacs) • 14 in-frame stops • 5 one-nucleotide cds • 10 redundant/overlapping • Used 1760 donor sites, 1783 acceptors and 391 start codons • 29 non-standard donors, 6 acceptors and 8 start codons Evaluation set for tomato • Used 362 of 399 genes used in training • Excluded 37 genes containing non-canonical starts, donors or acceptors • Determined prediction accuracy on this set: • sensitivity & specificity at nucleotide, exon & gene level Employed “leave-one-out” Jacknife training/evaluation method to reduce bias in accuracy results.

  6. Statistics of training set • intron length: 515 nt (29-6972 nt) • exon length: 162 nt (2-1,888 nt) • CDS length: 970 nt (67-2,940 nt) • exons/gene: 5.5 • avg. gene size: 4,951nt • GC (coding): 43% • GC (intron): 33% • # of exons: 2,188 • # of single genes: 64 • # coding bases: 386,811 • # non-coding nts:922,178 Tomato Annotation, Ghent October 2006

  7. Statistics of training set GC distributions Tomato Annotation, Ghent October 2006

  8. Accuracy of new parameter file Prediction/evaluation on full BACS: • Bacs not fully annotated • Extract 400 nt.-flanked genes from annotations Tomato Annotation, Ghent October 2006

  9. Accuracy of new parameter file Prediction/evaluation on genes Tomato Annotation, Ghent October 2006

  10. Geneid Predictions using new parameter file • 102 BACs with annotations • 443 fully sequenced BACs (bacs.v72.seq) • Masked versions of the Bacs above • used TIGR tomato/arabidopsis known repeat sequences library (TIGR_SolAth_repeat.fa obtained from SGN) Tomato Annotation, Ghent October 2006

  11. Accuracy of new parameter file Gff2ps plot of one prediction Tomato Annotation, Ghent October 2006

  12. Geneid predictions on chr 9 BACs using tomato parameter file

  13. Nucleotide ratios around splice sites/start codon in tomato: Donor: Acceptor: Start:

  14. Most markers are in heterochromatin • Most of them did not match any BAC • Gap of 46cM Chromosome 9 has a total of 142 markers But...

  15. Construction of a training set for gene prediction programa geneid • Waiting for a more complete • set (Shibata’s). • A parameter file constructed from 100 sequences from different Solanaceae species (50% tomato).

  16. Geneid predictions on 6 chr 9 BACs using sol parameter file Tomato Sequencing, Madison July 2006

  17. Geneid vs Geneseqer • Geneid prediced 22 genes in the 114,526 pb C09HBa0109D11.1 BAC • Most of the predictions are supported by ESTs results as shown by geneseqer • Geneseqer is another gene identification tool based on the “spliced alignment” of ESTs to the genomic sequence contained in the BAC

  18. Geneid with a parameter file obtained from solanaceae applied to 6 BACs from Chr9 Tomato Sequencing, Madison July 2006

  19. GC distribution of GC content between intron and exons in Solanaceae sequences used to train geneid • To be improved when a large set of FL from tomato is available (Shibata´s)

  20. European Commission EU-SOL Vicky FernandezSheila ZunigaAngela PerezFrancisco CamaraRoderic GuigóMiguel A BotellaAntonio Granell

More Related