1.06k likes | 1.26k Views
Making best use of TAIR tools and datasets. Philippe Lamesch Donghui Li The Arabidopsis Information Resource www.arabidopsis.org contact us: curator@arabidopsis.org. TAIR: The Arabidopsis Information Resource. collect, curate and distribute information on Arabidopsis
E N D
Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource www.arabidopsis.org contact us: curator@arabidopsis.org
TAIR: The Arabidopsis Information Resource • collect, curate and distribute information on Arabidopsis • information freely available from arabidopsis.org
Outline • Gene structure – Philippe Lamesch • Gene function – Donghui Li • Metabolic pathway – Donghui Li • New tools – Philippe Lamesch
Slides available from TAIR www.arabidopsis.org
TAIR is used worldwide Visits per month (source: Google Analytics)
What we do: (2) manual literature curation • Controlled vocabulary annotations Gene Ontology (GO) http://www.geneontology.org/ Plant Ontology (PO) http://www.plantontology.org/ • Gene name, symbol • Allele, phenotype • Summary statement composition
What we do: (3) metabolic pathway curation AraCyc A metabolic pathway database for Arabidopsis thaliana that contains information about both predicted and experimentally determined pathways, reactions, compounds, genes and enzymes. PlantCyc and PMN (Plant Metabolic Network)
What we do: (4) work with ABRC to distribute research material
Part I: The Arabidopsis genome annotation • A new approach for improving the Arabidopsis genome annotation • Where to find gene structure related data at TAIR • The Arabidopsis gene structure confidence ranking
Arabidopsis genome annotation • Arabidopsis genome sequenced almost 10 years ago • High quality sequence with few gaps • TIGR did initial genome annotation • TAIR took over responsibility in 2005 • Current TAIR9 stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs
Genome annotation at TAIR Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants
Genome annotation at TAIR Add novel genes Update exon/intron structures of existing genes Delete mispredicted genes Merge and split genes Change gene types Add splice-variants Annotate ‘atypical’ gene classes Short protein-coding genes Transposable element genes Trans. element Pseudogenes * * * ** * * uORFs (genes within UTR of other genes)
Arabidopsis gene structure annotation A new approach TAIR6-TAIR9: Use ESTs and cDNAs and a assembly tool called PASA to improve gene structures TAIR10: Use new experimental data and new prediction tools to further improve gene structure predictions TAIR10
Genome annotation TAIR6-TAIR9 Using PASA and ESTs/cDNAs NCBI Clustered transcripts
Genome annotation TAIR6-TAIR9 Using PASA and ESTs/cDNAs NCBI Clustered transcripts Resulting gene model
Genome annotation TAIR6-TAIR9 Using PASA and ESTs/cDNAs NCBI Clustered transcripts Resulting gene model comparison Previous gene model Novel genes New Splice-variants Gene structure updates
Manual annotation at TAIR: Apollo Short MS peptide Radish sequence alignments Eugene prediction dicot sequence alignments monocot sequence alignments Aceview gene predictions cDNAs ESTs 2 gene isoforms
TAIR10: using proteomics and RNA-seq data to improve genome annotation • 4-step process: • Mapping RNA seq & Peptides • Assembly/Gene built • Manual review • Integration (genome release/Gbrowse)
Mapping and Assembly • Mapping • RNA-seq sequences (Tophat (C. Trapnell), Supersplat(T.C. Mockler)) • Peptides (6-frame translation, spliced exon graph) • Assembly approaches • Augustus (M. Stanke) • Uses spliced RNA seq reads, peptides • Aim: Identify additional splice-variants, update existing genes • TAU (T.C. Mockler) • Uses spliced RNA seq reads • Aim: Identify additional splice-variants • Cufflinks (C. Trapnell) • Uses spliced and unspliced RNA seq data • Aim: Identify novel genes
Augustus RNA-seq datasets (Mockler Lab, Ecker Lab) TopHat, SuperSplat 200 Million aligned RNA-seq reads 203,000 clustered spliced RNA-seq junctions (spliced RNA-seq junction) 145,000 RNA-seq junctions based on >1 read
Augustus 145,000 RNA-seq junctions based on >1 read 260,000 peptides (Baerenfaller et al, Castellana et al) + ESTs & cDNAs + AGI models Augustus gene prediction 11% of RNA-seq junctions incorporated into Augustus models 64% of peptide sequences incorporated into Augustus models Predicted Augustus models: 5461 distinct models 1596 novel models
Categorisation/Review Incorrect junction in TAIR model Unsupported exon TAIR confidence rank TAIR Model Augustus Model (correction) TAU Models (Splice variants, NMD targets) RNA-seq Junctions (colour reflects matching model) Peptides
Augustus/TAU/Cufflinks Augustus • Incorporate 64% of peptides not contained in TAIR, 11 % for RNA-seq junctions • 5461 potential updated genes • 1596 potential novel genes TAU • 30,083 junctions distinct to Augustus or TAIR models • 10,902 junctions incorporated into 10,491 TAU models Cufflinks • 367 novel assemblies which fall above the 100 bp & >15 FPKM filter 4 #TE-filter applied to AUG and cufflinks models
Preliminary Results Augustus/TAU/Cufflinks predicted models are classified into categories: Novel genes Updated genes Splice-variants B-list Rejects 4
Preliminary Results Augustus/TAU/Cufflinks predicted models are classified into categories: Novel genes 21 Updated genes 812 Splice-variants 2134 B-list 1586 Rejects 2318 4
Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information
Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information
GBrowse Header Main Browser Window Track Menu
Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information
Where can you find gene structure data on TAIR? • ON GENE MODEL PAGE • Graphic of exon-intron structure • Coordinates of each exon • ON GBROWSE • Graphic display of structure and overlapping evidence data • ON FTP SITE • GFF files with exact structures of each gene model • Files with gene confidence ranking information
Gene Confidence Rank • Attributes confidence scores to all exons and gene models based on different types of experimental and computational evidence
Full support No support
New Tools at TAIR • N-Browse • GBrowse • Synteny viewer
New Tools at TAIR • N-Browse (in collaboration wit the Kris Gunsalus Lab, NYU) • GBrowse • Synteny viewer