Design and Data Basing of Genome-wide Gene Specific Tags

Complete Arabidopsis Transcriptome Micro Array Design and Data Basing of Genome-wide Gene Specific Tags

France Pierre Hilson* (coord.), Vincent Colot Pierre Rouzé Belgium Paul Van Hummelen Germany Wilfried Nietfeld United Kingdom Jim Beynon Mark Crowe / Martin Trick Switzerland Philippe Reymond Netherlands Peter Weisbeek Spain Javier Paz-Ares Rishikesh Bhalerao Sweden the partners in CATMA CATMA is a consortium of 10 research groups from 8 European countries built in October 2000

functional studies gene specific data Gene Specific Tags for Micro Arrays

GOAL Construction of a Gene Specific Tag (GST) collection representing most Arabidopsis genes

STEPS • Homogeneous structural re-annotation of the whole genome sequence using EUGENE • Search for Gene Specific Tags location in each gene and design of primers for PCR amplification using SPADS • Build the CATMA database and enter the GST, primer and gene data into it.

CATMA Database ( M. Crowe ) CATMA flow chart Genome sequence EUGENE ( T. Schiex ) Structural Annotation Gene models SPADS ( V. Thareau ) Design of GST primers GST & primers sequence PCR from BACs/genome Gene Sequence Tags Spotting Micro Arrays

STRUCTURAL ANNOTATION

WHY ? At the beginning of the project (october 2000) AGI annotation was nearly complete, but this annotation suffered from major drawbacks annotation methodology differed from one AGI consortium to another annotations having been done on several years, the first and the last differs in quality gene models were often wrong

HOW ? Based on validation of existing tools used for gene prediction (Pavy et al., 1999) we had a view on the efficiency of each of them for each gene feature and for gene modeling as a whole An “parasitic” software (EUGENE) was developed (Schiex et al., 2001) which integrates the various sources of information available to produce a gene model for the whole Arabidopsis genome

VALIDATION the data set : AraSet 566kb of Arabidopsis genome sequence containing 74 gene contigs of documented genes, each manually checked for consistency 57 contigs of 2 genes -> 114 14 contigs of 3 genes -> 42 3 contigs of 4 genes -> 12 168 genes (1028 exons, 860 introns) 94 intergenic sequences

sensitivity and specificity sensitivity (Se): true predictions / actual cases - how often is the software correct ? specificity (Sp): true predictions / total predictions - how many false predictions given ?

1999 evaluation of exon prediction Real exons : 1028 Pavy et al. (1999) Bioinformatics 15:887-899

1999 evaluation of gene prediction * * Correct gene model = every exon exactly predicted Pavy et al. (1999) Bioinformatics 15:887-899

EUGENE, a Direct Acyclic Graph Algorithm Schiex et al. (2001) LNCS, 2066:111-125

EUGENE features Integrate different sources of information e.g. in the current Arabidopsis v2 version built in : IMM for exon/intron/UTR/intergenic plug in : NetGene2, SplicePredictor, NetStart filters : RepeatMasker homology : BLAST search in protein & DNA DB Sim4 search in EST/cDNA collections borders : function to use 5’ & 3’ EST data. globally optimized to maximize gene prediction accuracy on a set of annotated sequences

a typical EUGENE output graph

2002 evaluation of gene prediction Sensitity % Specificity GenScan 17 19 C. Burge GenMark.hmm 41 37 M. Borodovsky GlimmerA 30 19 S. Salzberg FgenesH-GC 57 55 V. Solovyev Eugene 67 56 T. Schiex Eugene+ 76 68

+ 12.3 % THE Arabidopsis GENOME AGI : 26514 genes EUGENE : 29787 genes

Example : chromosome I AGI Number of genes Ch.I EuGène 350 300 250 200 150 100 50 0 Mb 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Mb AGI = 6494 genes EUGENE = 7489 genes

Aubourg, Samson, Brunaud & Lecharny URGV INRA/CNRS Evry, poster

AGI and EUGENE genes with exactly the same predicted structure : 10352 AGI and EUGENE genes with the same start and stop codon but internal differences : 3565 stop stop start start AGI AGI EuGene EuGene stop stop start start EuGene vs AGI using FLAGdb++ V.Brunaud, F.Samson, A. Lecharny, S. Aubourg -> poster

AGI genes without cognate EuGene gene : 2255 EuGene genes without cognate AGI gene : 3379 start stop start stop AGI EuGene start stop stop stop start start start stop start stop EuGene AGI start stop stop stop start start EuGene vs AGI

AGI genes with at least 2 overlapping/inserted EUGENE gene (Split of EuGene) : 2191 EuGene genes with at least 2 overlapping/inserted genes AGI (Split of AGI) : 409 AGI 2191 EuGene start stop start stop stop start stop start start stop start stop AGI EuGene start stop EuGene vs AGI 409

the CATMA gene set 29787 EUGENE predicted genes 29555 2334 documented/manually checked 1484 AGI genes, not detected by EUGENE 32 Non-coding RNAs (P.Green) 201 Controls 31272Complete CATMA non redundant gene set

GST design

1) Find a specific sequence for each gene gene A gene B region of homology between genes A & B the A probe has to be designed outside this region 2) Amplify a region => 2 primers, 1probe /gene RATIONALE the GSTs are designed in order to be specific for a single gene, even if it is a member of a gene family

SPADS SpecificPrimers &AmpliconDesignSoftware Vincent THAREAU GST location in the transcript : * The GST is entirely inside an exon or overlaps an intron (then  50% of the GST sequence is in exons) * GSTs are searched in the 3’->5’ direction, to take in account bias towards partial mRNAs missing 5’ sequences GST size :150 to 500 bp Specificity : * GST specificity : checked with BLASTn against the whole genome * Primer specificity : checked with blastn against the PCR template

No GST : 35% CATMA v.1 H class : 47% M class : 18% H class : similarity with the closest paralogue below 40% M class : similarity with the closest paralogue below 70%

CATMA v.1 features 21120 Gene Specific Tags almost 2/3rd of the GSTs are located in the 3’-most part of the transcript 97,4 % of the GSTs are entirely in exons 2,6 % of the GSTs are overlapping introns

Mark Crowe John Innes Center the CATMA database http://www.catma.org

QUERIES Preset Queries BLAST Advanced SQL Queries

CATMA v.2 Since january 2001 new genome data became available, especially full-length cDNAs and 5’/3’ borders ESTs (CERES, RIKEN) and the current annotation has improved (TIGR, MIPS) A second run of annotation is currently ongoing using a new version of EUGENE allowing to exploit 5’/3’ESTs

CATMA v.2 New GSTs will be designed with SPADS when the CATMA v.1 GSTs are no longer supported by the EUGENE re-annotation as well as for new genes SPADS will be re-run on predicted genes for which no GSTs can be designed after adding a 150bp tail after the STOP codon (virtual 3’UTR) objective >= 25000 GSTs

THE SONS OF CATMA

CAGE micro-array analysis Coordination : Martin Kuiper Goal : to allow comparison of micro-array transcript profiling experiments

AGRIKOLA Coordinator: Ian Small ArabidopsisnGenomic RNA Interference Knock-Out Line Analysis GOAL Lines silenced specifically for most Arabidopsis genes GST cloning GST-based hpRNA vectors comprehensive silenced line collection

Acknowledgements Carine Serizet Ghent (GénoPlante/VIB) Vincent Thareau Ghent (GénoPlante) Mark Crowe JIC Norwich Sébastien Aubourg VIB Ghent/ URGV Evry Thomas Schiex, Sylvain Foissac INRA Toulouse Patrice Déhais / Eric Bonnet VIB Gent Stephane Rombauts VIB Gent Pierre Rouzé INRA Ghent Pierre Hilson URGV Evry / VIB Ghent Fundings GénoPlante, URGV, VIB

Design and Data Basing of Genome-wide Gene Specific Tags

Design and Data Basing of Genome-wide Gene Specific Tags

Presentation Transcript

DNA, Gene, and Genome

Genome-wide Association Studies

Design and Analysis of Genome-Wide Association Studies

Building and Analyzing Genome-Wide Gene Disruption Networks

Network-based Analysis of Genome-wide Association Study (GWAS) Data

Basing

TAGs and Early Data

Evolution of Eukaryotic Genome Gene 342

Genome-Wide Association Study

Genome-wide association studies

Computational analysis of genome-wide expression data

Genome-wide Associations

Genome-Wide Association Studies

Genome-wide association studies

Genome-specific Curation

Design and Analysis of Genome-wide Association Studies

Genome-wide association studies

Detection of gene-gene interactions in genome-wide association studies

Genome-wide Studies: Association

Deeply investigating and analysis chemical genome wide fitness data. Predict gene-functional

Genome-Wide Association Studies

Genome-wide Association