Unraveling Tomato Genome with Bioinformatic Framework

A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome 15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

Outline • Introduction • Data management • Annotation • Training/Test gene set • Summary

Arabidopsis thaliana Arabidopsis lyrata * Capsella rubella * Maize Rice Medicago Lotus Solanum lycopersicum MIPS´ look at the Green Side of Life – genome projects and database activities –

MIPS´ look at the Green Side of Life – genome projects and database activities – • Need to streamline and unify databases as well as analytical schemas and operation routines • Strong synergism and very robust • Risk to loose flexibility and „custom tailor“ attractiveness • Awareness that not every genome and every community„is just the same“

From Center Centric Strategies to distributed Approaches Typically, genome projects undergo particular phases: • Sequenced BACs are annotated • Gene models are published to the community • Potentially generates competition rather than collaboration among groups

From Center Centric Strategies to distributed Approaches Consequences can be: • underlying analytical procedures are not always tested, trained and evaluated • Between groups more or less pronounced differences exist--> differing, contradicting and confliciting data

Aim of all groups: „information enriched high quality genome backbone to address genome scale biological questions“

From Center Centric Strategies to distributed Approaches An example ... • International Medicago Genome Annotation Group • Consists of groups participating either in the International or the European Medicago Genome Initiative annotation/ bioinformatics programs • Agreement on common annotation standards, data exchange formats and naming conventions • Aims to produce and provide unified high-quality Medicago data set

From Center Centric Strategies to distributed Approaches Advantages of sharing efforts in genome annotation within a common annotation pipeline

From Center Centric Strategies to distributed Approaches • prevents from:(i) duplicating efforts(ii) conflicts resulted from different annotation “standards” • ensures high-quality annotation standards • ensures common (gene) naming  common dataset • Integrates and profits from knowledge and expertise of the individual groups

Data management All data should be organized in agenome database

Wishlist for a modern genome db • Complete • Comprehensive • Up-to-date • Integrated • User interface • Application interface • State-of-the-art automatic analysis • Adaptable • Cross-genome comparison …low cost, low manpower...

PlantsDB Philosophy • Plants Genome Resource: provides and integrates sequence data from European plant sequencing consortia along with publically available data from the international initiative • Plants DB communicates bioinformatic analysis data (visualization, genetic elements, structural data, ontologies, domains...; BLAST, browse and search,…comparative analysis) • Integration: provides a distributed network to integrate and retrieve data from heterogenous resources using BioMOBY (connection to other plant DBs, PlaNET)

Preliminary Annotation Pipeline Towards a preliminary annotation

Repeat Detection Repeat Ontology RepeatMasker Masked sequences Repeat annotation Gene prediction GAMEXML

EST DB GAMEXML ESTAssemblies Protein DB e.g. SwissProt PlantsDB Gene Prediction External Databases Gene prediction programs ► GenomeThreader►FGenesH++/ProtMap►GeneMarkHMM Document of computational results Manual annotation inApollo Genome Viewer Web Access Gbrowse

First Results

Repeat Masker • 5.8 MB analysed (48 BACs) • ~ 6.7 % repetitive elements(<0.2% - 23% per bac) • ~ 1 min/100 kb Repeat content [%] whole genome(euchromatic part): ~ 2 days BACs State: December 2005

Preliminary Results Comparison of different gene finders

EST/TC GeneMark FGeneSH EST/TC ab initio predictions

ab initio predictions

ab initio predictions • FGeneSH++ and GeneMarkHMM often generate incomplete or wrong gene models at the moment • There are no matrices available that are trained for tomato •  Tomato matrices will increase prediction quality dramatically •  Collection of annotated high quality genes for a training/test set for EuGene, FGeneSH, GeneMarkHMM, ...

Training/Test Gene Set How can we get a training/test set?  Map available tomato cDNA/ESTs to the BACs(use only high confident matches)  Link experimental data to the genemodels  Use this gene set for ab initio gene finder training

GenomeThreader GenomeThreader used for EST/cDNA-Mapping: • similarity-based approach:EST/Proteins used to predict gene structure via optimal spliced alignments • Offers many options (full user control) • incremental updates (avoids a lot of duplicated computations) • Improved GeneSeqer

GenomeThreader - calculations (single CPU, euchromatic part)

Tobacco Potato Microtom Tomato Example

Examples - UK

Example

Number of high quality genes # genes • Number of genes: 164(covered completely by cDNA/ESTs) • ~3.4 genes/BAC(range: 0 - 9 genes/BAC) • These genes can be used to train gene finders BAC (Only very good alignments considered)

Gene Finder Which program can be trained for tomato? One possibility is EuGene (VIB Gent) - performed well e.g. for Arabidopsis and Medicago • available as soon as test/training gene set is large enough

EuGene - overview Plugins Statistical contents DNA Markov AA Markov Splice sites Optimize plugin combination NetGene2 Plugin training GeneSplicer Test SpliceMachine SplicePredictor Start sites SpliceMachine Needs one dataset Needs one dataset Needs one dataset NetStart ATRPred Similarities new EST similarities Protein similarities FL cDNA Repeats Exon conservation TRAINING OPTIM TEST

EuGene • First round training: - 500 high quality tomato genes - statistical models on codon usage and splice sites of Arabidopsis will be used • Second round training: - 2000 high quality tomato genes - Build a tomato-only version of EuGene Approx. 150 BACs needed for first round training

Current state of sequenced BACs Total number of BACs: - unfinished: 71 - finished: 87 - available: 52

Summary • ab initio gene finders are not yet calibrated to tomato • Need of a test/training gene set to calibrate the gene finders • We need another 100 BACs to get enough genes for a first round training of EuGene • GenomeThreader produces only good alignments with ESTs from SOL-species (Tomato, Potato, Tobacco) • More repeats will be detected (will be included in RepeatMasker Library)

Acknowledgments Sequencing & Assembly(Chromosome 4)Sanger Institute Christine Nicholson Sean Humphray MPIZ Köln Heiko Schoof EuGeneVIB Gent Stephane Rombauts GenomeThreaderUniversity of Hamburg Gordon GremmeStefan KurtzVolker Brendel Automated annotation MIPS Heidrun GundlachGeorg HabererManuel SpannaglKlaus F.X. Mayer Manual Annotation/Curation/Web-site(Chromosome 4)Imperial CollegeDaniel BuchanJames Abbot Sarah ButcherGerard Bishop

A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome 15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

Unraveling Tomato Genome with Bioinformatic Framework