1 / 37

Design and Data Basing of Genome-wide Gene Specific Tags

C omplete A rabidopsis T ranscriptome M icro A rray. Design and Data Basing of Genome-wide Gene Specific Tags. France . Pierre Hilson * (coord.), Vincent Colot. Pierre Rouzé. Belgium. Paul Van Hummelen. Germany. Wilfried Nietfeld. United Kingdom. Jim Beynon.

kaloni
Download Presentation

Design and Data Basing of Genome-wide Gene Specific Tags

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Complete Arabidopsis Transcriptome Micro Array Design and Data Basing of Genome-wide Gene Specific Tags

  2. France Pierre Hilson* (coord.), Vincent Colot Pierre Rouzé Belgium Paul Van Hummelen Germany Wilfried Nietfeld United Kingdom Jim Beynon Mark Crowe / Martin Trick Switzerland Philippe Reymond Netherlands Peter Weisbeek Spain Javier Paz-Ares Rishikesh Bhalerao Sweden the partners in CATMA CATMA is a consortium of 10 research groups from 8 European countries built in October 2000

  3. functional studies gene specific data Gene Specific Tags for Micro Arrays

  4. GOAL Construction of a Gene Specific Tag (GST) collection representing most Arabidopsis genes

  5. STEPS • Homogeneous structural re-annotation of the whole genome sequence using EUGENE • Search for Gene Specific Tags location in each gene and design of primers for PCR amplification using SPADS • Build the CATMA database and enter the GST, primer and gene data into it.

  6. CATMA Database ( M. Crowe ) CATMA flow chart Genome sequence EUGENE ( T. Schiex ) Structural Annotation Gene models SPADS ( V. Thareau ) Design of GST primers GST & primers sequence PCR from BACs/genome Gene Sequence Tags Spotting Micro Arrays

  7. STRUCTURAL ANNOTATION

  8. WHY ? At the beginning of the project (october 2000) AGI annotation was nearly complete, but this annotation suffered from major drawbacks annotation methodology differed from one AGI consortium to another annotations having been done on several years, the first and the last differs in quality gene models were often wrong

  9. HOW ? Based on validation of existing tools used for gene prediction (Pavy et al., 1999) we had a view on the efficiency of each of them for each gene feature and for gene modeling as a whole An “parasitic” software (EUGENE) was developed (Schiex et al., 2001) which integrates the various sources of information available to produce a gene model for the whole Arabidopsis genome

  10. VALIDATION the data set : AraSet 566kb of Arabidopsis genome sequence containing 74 gene contigs of documented genes, each manually checked for consistency 57 contigs of 2 genes -> 114 14 contigs of 3 genes -> 42 3 contigs of 4 genes -> 12 168 genes (1028 exons, 860 introns) 94 intergenic sequences

  11. sensitivity and specificity sensitivity (Se): true predictions / actual cases - how often is the software correct ? specificity (Sp): true predictions / total predictions - how many false predictions given ?

  12. 1999 evaluation of exon prediction Real exons : 1028 Pavy et al. (1999) Bioinformatics 15:887-899

  13. 1999 evaluation of gene prediction * * Correct gene model = every exon exactly predicted Pavy et al. (1999) Bioinformatics 15:887-899

  14. EUGENE, a Direct Acyclic Graph Algorithm Schiex et al. (2001) LNCS, 2066:111-125

  15. EUGENE features Integrate different sources of information e.g. in the current Arabidopsis v2 version built in : IMM for exon/intron/UTR/intergenic plug in : NetGene2, SplicePredictor, NetStart filters : RepeatMasker homology : BLAST search in protein & DNA DB Sim4 search in EST/cDNA collections borders : function to use 5’ & 3’ EST data. globally optimized to maximize gene prediction accuracy on a set of annotated sequences

  16. a typical EUGENE output graph

  17. 2002 evaluation of gene prediction Sensitity % Specificity GenScan 17 19 C. Burge GenMark.hmm 41 37 M. Borodovsky GlimmerA 30 19 S. Salzberg FgenesH-GC 57 55 V. Solovyev Eugene 67 56 T. Schiex Eugene+ 76 68

  18. + 12.3 % THE Arabidopsis GENOME AGI : 26514 genes EUGENE : 29787 genes

  19. Example : chromosome I AGI Number of genes Ch.I EuGène 350 300 250 200 150 100 50 0 Mb 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Mb AGI = 6494 genes EUGENE = 7489 genes

  20. Aubourg, Samson, Brunaud & Lecharny URGV INRA/CNRS Evry, poster

  21. AGI and EUGENE genes with exactly the same predicted structure : 10352 AGI and EUGENE genes with the same start and stop codon but internal differences : 3565 stop stop start start AGI AGI EuGene EuGene stop stop start start EuGene vs AGI using FLAGdb++ V.Brunaud, F.Samson, A. Lecharny, S. Aubourg -> poster

  22. AGI genes without cognate EuGene gene : 2255 EuGene genes without cognate AGI gene : 3379 start stop start stop AGI EuGene start stop stop stop start start start stop start stop EuGene AGI start stop stop stop start start EuGene vs AGI

  23. AGI genes with at least 2 overlapping/inserted EUGENE gene (Split of EuGene) : 2191 EuGene genes with at least 2 overlapping/inserted genes AGI (Split of AGI) : 409 AGI 2191 EuGene start stop start stop stop start stop start start stop start stop AGI EuGene start stop EuGene vs AGI 409

  24. the CATMA gene set 29787 EUGENE predicted genes 29555 2334 documented/manually checked 1484 AGI genes, not detected by EUGENE 32 Non-coding RNAs (P.Green) 201 Controls 31272Complete CATMA non redundant gene set

  25. GST design

  26. 1) Find a specific sequence for each gene gene A gene B region of homology between genes A & B the A probe has to be designed outside this region 2) Amplify a region => 2 primers, 1probe /gene RATIONALE the GSTs are designed in order to be specific for a single gene, even if it is a member of a gene family

  27. SPADS SpecificPrimers &AmpliconDesignSoftware Vincent THAREAU GST location in the transcript : * The GST is entirely inside an exon or overlaps an intron (then  50% of the GST sequence is in exons) * GSTs are searched in the 3’->5’ direction, to take in account bias towards partial mRNAs missing 5’ sequences GST size :150 to 500 bp Specificity : * GST specificity : checked with BLASTn against the whole genome * Primer specificity : checked with blastn against the PCR template

  28. No GST : 35% CATMA v.1 H class : 47% M class : 18% H class : similarity with the closest paralogue below 40% M class : similarity with the closest paralogue below 70%

  29. CATMA v.1 features 21120 Gene Specific Tags almost 2/3rd of the GSTs are located in the 3’-most part of the transcript 97,4 % of the GSTs are entirely in exons 2,6 % of the GSTs are overlapping introns

  30. Mark Crowe John Innes Center the CATMA database http://www.catma.org

  31. QUERIES Preset Queries BLAST Advanced SQL Queries

  32. CATMA v.2 Since january 2001 new genome data became available, especially full-length cDNAs and 5’/3’ borders ESTs (CERES, RIKEN) and the current annotation has improved (TIGR, MIPS) A second run of annotation is currently ongoing using a new version of EUGENE allowing to exploit 5’/3’ESTs

  33. CATMA v.2 New GSTs will be designed with SPADS when the CATMA v.1 GSTs are no longer supported by the EUGENE re-annotation as well as for new genes SPADS will be re-run on predicted genes for which no GSTs can be designed after adding a 150bp tail after the STOP codon (virtual 3’UTR) objective >= 25000 GSTs

  34. THE SONS OF CATMA

  35. CAGE micro-array analysis Coordination : Martin Kuiper Goal : to allow comparison of micro-array transcript profiling experiments

  36. AGRIKOLA Coordinator: Ian Small ArabidopsisnGenomic RNA Interference Knock-Out Line Analysis GOAL Lines silenced specifically for most Arabidopsis genes GST cloning GST-based hpRNA vectors comprehensive silenced line collection

  37. Acknowledgements Carine Serizet Ghent (GénoPlante/VIB) Vincent Thareau Ghent (GénoPlante) Mark Crowe JIC Norwich Sébastien Aubourg VIB Ghent/ URGV Evry Thomas Schiex, Sylvain Foissac INRA Toulouse Patrice Déhais / Eric Bonnet VIB Gent Stephane Rombauts VIB Gent Pierre Rouzé INRA Ghent Pierre Hilson URGV Evry / VIB Ghent Fundings GénoPlante, URGV, VIB

More Related