1 / 36

Unraveling Tomato Genome with Bioinformatic Framework

This article discusses the use of a bioinformatic framework to reveal the secrets of the tomato genome. It examines data management, annotation, training/test gene sets, and highlights the importance of sharing efforts in genome annotation.

ehensley
Download Presentation

Unraveling Tomato Genome with Bioinformatic Framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome 15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

  2. Outline • Introduction • Data management • Annotation • Training/Test gene set • Summary

  3. Arabidopsis thaliana Arabidopsis lyrata * Capsella rubella * Maize Rice Medicago Lotus Solanum lycopersicum MIPS´ look at the Green Side of Life – genome projects and database activities –

  4. MIPS´ look at the Green Side of Life – genome projects and database activities – • Need to streamline and unify databases as well as analytical schemas and operation routines • Strong synergism and very robust • Risk to loose flexibility and „custom tailor“ attractiveness • Awareness that not every genome and every community„is just the same“

  5. From Center Centric Strategies to distributed Approaches Typically, genome projects undergo particular phases: • Sequenced BACs are annotated • Gene models are published to the community • Potentially generates competition rather than collaboration among groups

  6. From Center Centric Strategies to distributed Approaches Consequences can be: • underlying analytical procedures are not always tested, trained and evaluated • Between groups more or less pronounced differences exist--> differing, contradicting and confliciting data

  7. Aim of all groups: „information enriched high quality genome backbone to address genome scale biological questions“

  8. From Center Centric Strategies to distributed Approaches An example ... • International Medicago Genome Annotation Group • Consists of groups participating either in the International or the European Medicago Genome Initiative annotation/ bioinformatics programs • Agreement on common annotation standards, data exchange formats and naming conventions • Aims to produce and provide unified high-quality Medicago data set

  9. From Center Centric Strategies to distributed Approaches Advantages of sharing efforts in genome annotation within a common annotation pipeline

  10. From Center Centric Strategies to distributed Approaches • prevents from:(i) duplicating efforts(ii) conflicts resulted from different annotation “standards” • ensures high-quality annotation standards • ensures common (gene) naming  common dataset • Integrates and profits from knowledge and expertise of the individual groups

  11. Data management All data should be organized in agenome database

  12. Wishlist for a modern genome db • Complete • Comprehensive • Up-to-date • Integrated • User interface • Application interface • State-of-the-art automatic analysis • Adaptable • Cross-genome comparison …low cost, low manpower...

  13. PlantsDB Philosophy • Plants Genome Resource: provides and integrates sequence data from European plant sequencing consortia along with publically available data from the international initiative • Plants DB communicates bioinformatic analysis data (visualization, genetic elements, structural data, ontologies, domains...; BLAST, browse and search,…comparative analysis) • Integration: provides a distributed network to integrate and retrieve data from heterogenous resources using BioMOBY (connection to other plant DBs, PlaNET)

  14. Preliminary Annotation Pipeline Towards a preliminary annotation

  15. Repeat Detection Repeat Ontology RepeatMasker Masked sequences Repeat annotation Gene prediction GAMEXML

  16. EST DB GAMEXML ESTAssemblies Protein DB e.g. SwissProt PlantsDB Gene Prediction External Databases Gene prediction programs ► GenomeThreader►FGenesH++/ProtMap►GeneMarkHMM Document of computational results Manual annotation inApollo Genome Viewer Web Access Gbrowse

  17. First Results

  18. Repeat Masker • 5.8 MB analysed (48 BACs) • ~ 6.7 % repetitive elements(<0.2% - 23% per bac) • ~ 1 min/100 kb Repeat content [%] whole genome(euchromatic part): ~ 2 days BACs State: December 2005

  19. Preliminary Results Comparison of different gene finders

  20. EST/TC GeneMark FGeneSH EST/TC ab initio predictions

  21. ab initio predictions

  22. ab initio predictions • FGeneSH++ and GeneMarkHMM often generate incomplete or wrong gene models at the moment • There are no matrices available that are trained for tomato •  Tomato matrices will increase prediction quality dramatically •  Collection of annotated high quality genes for a training/test set for EuGene, FGeneSH, GeneMarkHMM, ...

  23. Training/Test Gene Set How can we get a training/test set?  Map available tomato cDNA/ESTs to the BACs(use only high confident matches)  Link experimental data to the genemodels  Use this gene set for ab initio gene finder training

  24. GenomeThreader GenomeThreader used for EST/cDNA-Mapping: • similarity-based approach:EST/Proteins used to predict gene structure via optimal spliced alignments • Offers many options (full user control) • incremental updates (avoids a lot of duplicated computations) • Improved GeneSeqer

  25. GenomeThreader - calculations (single CPU, euchromatic part)

  26. Tobacco Potato Microtom Tomato Example

  27. Examples - UK

  28. Example

  29. Number of high quality genes # genes • Number of genes: 164(covered completely by cDNA/ESTs) • ~3.4 genes/BAC(range: 0 - 9 genes/BAC) • These genes can be used to train gene finders BAC (Only very good alignments considered)

  30. Gene Finder Which program can be trained for tomato? One possibility is EuGene (VIB Gent) - performed well e.g. for Arabidopsis and Medicago • available as soon as test/training gene set is large enough

  31. EuGene - overview Plugins Statistical contents DNA Markov AA Markov Splice sites Optimize plugin combination NetGene2 Plugin training GeneSplicer Test SpliceMachine SplicePredictor Start sites SpliceMachine Needs one dataset Needs one dataset Needs one dataset NetStart ATRPred Similarities new EST similarities Protein similarities FL cDNA Repeats Exon conservation TRAINING OPTIM TEST

  32. EuGene • First round training: - 500 high quality tomato genes - statistical models on codon usage and splice sites of Arabidopsis will be used • Second round training: - 2000 high quality tomato genes - Build a tomato-only version of EuGene Approx. 150 BACs needed for first round training

  33. Current state of sequenced BACs Total number of BACs: - unfinished: 71 - finished: 87 - available: 52

  34. Summary • ab initio gene finders are not yet calibrated to tomato • Need of a test/training gene set to calibrate the gene finders • We need another 100 BACs to get enough genes for a first round training of EuGene • GenomeThreader produces only good alignments with ESTs from SOL-species (Tomato, Potato, Tobacco) • More repeats will be detected (will be included in RepeatMasker Library)

  35. Acknowledgments Sequencing & Assembly(Chromosome 4)Sanger Institute Christine Nicholson Sean Humphray MPIZ Köln Heiko Schoof EuGeneVIB Gent Stephane Rombauts GenomeThreaderUniversity of Hamburg Gordon GremmeStefan KurtzVolker Brendel Automated annotation MIPS Heidrun GundlachGeorg HabererManuel SpannaglKlaus F.X. Mayer Manual Annotation/Curation/Web-site(Chromosome 4)Imperial CollegeDaniel BuchanJames Abbot Sarah ButcherGerard Bishop

  36. A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome 15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

More Related