370 likes | 477 Views
C omplete A rabidopsis T ranscriptome M icro A rray. Design and Data Basing of Genome-wide Gene Specific Tags. France . Pierre Hilson * (coord.), Vincent Colot. Pierre Rouzé. Belgium. Paul Van Hummelen. Germany. Wilfried Nietfeld. United Kingdom. Jim Beynon.
E N D
Complete Arabidopsis Transcriptome Micro Array Design and Data Basing of Genome-wide Gene Specific Tags
France Pierre Hilson* (coord.), Vincent Colot Pierre Rouzé Belgium Paul Van Hummelen Germany Wilfried Nietfeld United Kingdom Jim Beynon Mark Crowe / Martin Trick Switzerland Philippe Reymond Netherlands Peter Weisbeek Spain Javier Paz-Ares Rishikesh Bhalerao Sweden the partners in CATMA CATMA is a consortium of 10 research groups from 8 European countries built in October 2000
functional studies gene specific data Gene Specific Tags for Micro Arrays
GOAL Construction of a Gene Specific Tag (GST) collection representing most Arabidopsis genes
STEPS • Homogeneous structural re-annotation of the whole genome sequence using EUGENE • Search for Gene Specific Tags location in each gene and design of primers for PCR amplification using SPADS • Build the CATMA database and enter the GST, primer and gene data into it.
CATMA Database ( M. Crowe ) CATMA flow chart Genome sequence EUGENE ( T. Schiex ) Structural Annotation Gene models SPADS ( V. Thareau ) Design of GST primers GST & primers sequence PCR from BACs/genome Gene Sequence Tags Spotting Micro Arrays
WHY ? At the beginning of the project (october 2000) AGI annotation was nearly complete, but this annotation suffered from major drawbacks annotation methodology differed from one AGI consortium to another annotations having been done on several years, the first and the last differs in quality gene models were often wrong
HOW ? Based on validation of existing tools used for gene prediction (Pavy et al., 1999) we had a view on the efficiency of each of them for each gene feature and for gene modeling as a whole An “parasitic” software (EUGENE) was developed (Schiex et al., 2001) which integrates the various sources of information available to produce a gene model for the whole Arabidopsis genome
VALIDATION the data set : AraSet 566kb of Arabidopsis genome sequence containing 74 gene contigs of documented genes, each manually checked for consistency 57 contigs of 2 genes -> 114 14 contigs of 3 genes -> 42 3 contigs of 4 genes -> 12 168 genes (1028 exons, 860 introns) 94 intergenic sequences
sensitivity and specificity sensitivity (Se): true predictions / actual cases - how often is the software correct ? specificity (Sp): true predictions / total predictions - how many false predictions given ?
1999 evaluation of exon prediction Real exons : 1028 Pavy et al. (1999) Bioinformatics 15:887-899
1999 evaluation of gene prediction * * Correct gene model = every exon exactly predicted Pavy et al. (1999) Bioinformatics 15:887-899
EUGENE, a Direct Acyclic Graph Algorithm Schiex et al. (2001) LNCS, 2066:111-125
EUGENE features Integrate different sources of information e.g. in the current Arabidopsis v2 version built in : IMM for exon/intron/UTR/intergenic plug in : NetGene2, SplicePredictor, NetStart filters : RepeatMasker homology : BLAST search in protein & DNA DB Sim4 search in EST/cDNA collections borders : function to use 5’ & 3’ EST data. globally optimized to maximize gene prediction accuracy on a set of annotated sequences
2002 evaluation of gene prediction Sensitity % Specificity GenScan 17 19 C. Burge GenMark.hmm 41 37 M. Borodovsky GlimmerA 30 19 S. Salzberg FgenesH-GC 57 55 V. Solovyev Eugene 67 56 T. Schiex Eugene+ 76 68
+ 12.3 % THE Arabidopsis GENOME AGI : 26514 genes EUGENE : 29787 genes
Example : chromosome I AGI Number of genes Ch.I EuGène 350 300 250 200 150 100 50 0 Mb 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Mb AGI = 6494 genes EUGENE = 7489 genes
Aubourg, Samson, Brunaud & Lecharny URGV INRA/CNRS Evry, poster
AGI and EUGENE genes with exactly the same predicted structure : 10352 AGI and EUGENE genes with the same start and stop codon but internal differences : 3565 stop stop start start AGI AGI EuGene EuGene stop stop start start EuGene vs AGI using FLAGdb++ V.Brunaud, F.Samson, A. Lecharny, S. Aubourg -> poster
AGI genes without cognate EuGene gene : 2255 EuGene genes without cognate AGI gene : 3379 start stop start stop AGI EuGene start stop stop stop start start start stop start stop EuGene AGI start stop stop stop start start EuGene vs AGI
AGI genes with at least 2 overlapping/inserted EUGENE gene (Split of EuGene) : 2191 EuGene genes with at least 2 overlapping/inserted genes AGI (Split of AGI) : 409 AGI 2191 EuGene start stop start stop stop start stop start start stop start stop AGI EuGene start stop EuGene vs AGI 409
the CATMA gene set 29787 EUGENE predicted genes 29555 2334 documented/manually checked 1484 AGI genes, not detected by EUGENE 32 Non-coding RNAs (P.Green) 201 Controls 31272Complete CATMA non redundant gene set
1) Find a specific sequence for each gene gene A gene B region of homology between genes A & B the A probe has to be designed outside this region 2) Amplify a region => 2 primers, 1probe /gene RATIONALE the GSTs are designed in order to be specific for a single gene, even if it is a member of a gene family
SPADS SpecificPrimers &AmpliconDesignSoftware Vincent THAREAU GST location in the transcript : * The GST is entirely inside an exon or overlaps an intron (then 50% of the GST sequence is in exons) * GSTs are searched in the 3’->5’ direction, to take in account bias towards partial mRNAs missing 5’ sequences GST size :150 to 500 bp Specificity : * GST specificity : checked with BLASTn against the whole genome * Primer specificity : checked with blastn against the PCR template
No GST : 35% CATMA v.1 H class : 47% M class : 18% H class : similarity with the closest paralogue below 40% M class : similarity with the closest paralogue below 70%
CATMA v.1 features 21120 Gene Specific Tags almost 2/3rd of the GSTs are located in the 3’-most part of the transcript 97,4 % of the GSTs are entirely in exons 2,6 % of the GSTs are overlapping introns
Mark Crowe John Innes Center the CATMA database http://www.catma.org
QUERIES Preset Queries BLAST Advanced SQL Queries
CATMA v.2 Since january 2001 new genome data became available, especially full-length cDNAs and 5’/3’ borders ESTs (CERES, RIKEN) and the current annotation has improved (TIGR, MIPS) A second run of annotation is currently ongoing using a new version of EUGENE allowing to exploit 5’/3’ESTs
CATMA v.2 New GSTs will be designed with SPADS when the CATMA v.1 GSTs are no longer supported by the EUGENE re-annotation as well as for new genes SPADS will be re-run on predicted genes for which no GSTs can be designed after adding a 150bp tail after the STOP codon (virtual 3’UTR) objective >= 25000 GSTs
CAGE micro-array analysis Coordination : Martin Kuiper Goal : to allow comparison of micro-array transcript profiling experiments
AGRIKOLA Coordinator: Ian Small ArabidopsisnGenomic RNA Interference Knock-Out Line Analysis GOAL Lines silenced specifically for most Arabidopsis genes GST cloning GST-based hpRNA vectors comprehensive silenced line collection
Acknowledgements Carine Serizet Ghent (GénoPlante/VIB) Vincent Thareau Ghent (GénoPlante) Mark Crowe JIC Norwich Sébastien Aubourg VIB Ghent/ URGV Evry Thomas Schiex, Sylvain Foissac INRA Toulouse Patrice Déhais / Eric Bonnet VIB Gent Stephane Rombauts VIB Gent Pierre Rouzé INRA Ghent Pierre Hilson URGV Evry / VIB Ghent Fundings GénoPlante, URGV, VIB