320 likes | 574 Views
The Sequence Read Archive at EBI. Guy Cochrane, EMBL-EBI. European Nucleotide Archive. 2. 16.08.2014. European Nucleotide Archive. ENA Mechanisms. Sequence similarity search Term search Download Browse Pipe into analysis tools APIs. Direct presentation. Local data capture. Data
E N D
The Sequence Read Archive at EBI Guy Cochrane, EMBL-EBI
European Nucleotide Archive 2 16.08.2014 European Nucleotide Archive
ENA Mechanisms Sequence similarity search Term search Download Browse Pipe into analysis toolsAPIs Direct presentation Local data capture Data exchange • Ensembl • - genebuild • variation • regulatory build • UniProt • ArrayExpress • 1k Genomes DCC Infrastructure service Brokered submissions 3 16.08.2014 European Nucleotide Archive
SRA service Establish global repository for next gen. platform data submission services through extension of data exchange collaborations with partners at NCBI and DDBJ Provide route for data dissemination as ongoing infrastructure to support large-scale studies as a complement to publications relieve data generators of large hardware requirements Provide data access to users for re/meta-analysis of existing data to enable serendipitous discoveries 4 16.08.2014 European Nucleotide Archive
Next gen. brings broader applications de novo assembly re-sequencing gene expression gene discovery epigenomics community genomics & transcriptomics others 5 16.08.2014 European Nucleotide Archive
Next gen. is different Read length Data volume per run Metadata:data volume ratio Read substructure Complexity of metadata 6 16.08.2014 European Nucleotide Archive
A sustainable data model for SRA Study, sample and experimental information Publication and author information Machine configuration Access to run datasets Access to selected reads within runs Intensity Noise data Sequence Quality 7 16.08.2014 European Nucleotide Archive
A sustainable data model for SRA SRA XML schema format-specific toolkit specialist binary formats (SRF and SRA) 8 16.08.2014 European Nucleotide Archive
Status Infrastructure Metadata schema initiated by NCBI, now under co-development with EBI Adoption of community data format, SRF Migration to NCBI’s SRA data format Common accession namespace established with NCBI Data capture Large sequencing centres (eg. Sanger, BGI, etc.) Small-scale submissions (Sanger Pathogen Sequencing Unit, Illumina UK, etc.) Data and metadata exchange NCBI-collected data and metadata mirrored at EBI Data presentation All metadata available as XML All data available via FTP and Aspera Beta browser launched in early October 9 16.08.2014 European Nucleotide Archive
SRA contents Nucleotides (terabases) 985 studies 1,253 organisms 5,329 samples 27, 662 runs 9,296 experiments 10 16.08.2014 European Nucleotide Archive
SRA by platform 11 16.08.2014 European Nucleotide Archive
SRA by study type 12 16.08.2014 European Nucleotide Archive
Aspera technology fasp protocol significantly faster than FTP Intelligent adaptive rate control mechanism Secure On-the-fly Data Encryption Integrity Verification Client download: http://www.asperasoft.com/downloads/connect-win.html Command line client and web browser plug-in 13 16.08.2014 European Nucleotide Archive
Submissions Manual XML examples and information about supported data formats made available Data and XML metadata files uploaded with ftp or Aspera into drop box notification e-mail to datasubs@ebi.ac.uk to initiate processing Metadata files validated and cross-checked against data files Accessions returned by e-mail • Automated • Data files are uploaded into drop box • RESTful service used to submit data files and XML metadata files • Metadata file validation and accessioning is synchronous • Data file validation is asynchronous http://www.ebi.ac.uk/embl/Documentation/ENA-Reads.html datasubs@ebi.ac.uk 14 16.08.2014 European Nucleotide Archive
SRA as infrastructure ArrayExpress: sequence-based transcriptomics data Bioinvestigation Index: sequence-based multi-omics data European Genome-Phenome Archive (EGA): sequence-based, ethically protected data 15 16.08.2014 European Nucleotide Archive
Retrieval Data FTP: ftp.era.ebi.ac.uk, Aspera: fasp.era.ebi.ac.uk Metadata FTP: ftp.era-xml.ebi.ac.uk Browser (in beta) http://www.ebi.ac.uk/ena/data/view/<SRA object accession>&display=xml http://www.ebi.ac.uk/ena/data/view/<SRA object accession>&display=html Search by accession/description text EB-eye search tool on all EBI pages http://www.ebi.ac.uk/ebisearch/advancedsearch.ebi 16 16.08.2014 European Nucleotide Archive
EB-eye search 17 16.08.2014 European Nucleotide Archive
Summary of hits 18 16.08.2014 European Nucleotide Archive
Submission view 19 16.08.2014 European Nucleotide Archive
Study view 20 16.08.2014 European Nucleotide Archive
Sample view 21 16.08.2014 European Nucleotide Archive
Experiment view 22 16.08.2014 European Nucleotide Archive
Run view 23 16.08.2014 European Nucleotide Archive
Currently provided in Sequence Read Format (SRF) and SRA toolkit Derived fastq files available BAM format to be added in due course Data files can hold intensity data, base calls and qualities. Data files 24 16.08.2014 European Nucleotide Archive
Data file manipulation • SRF • Io_lib of the Staden package, http://sourceforge.net/projects/staden • Solid2srf provided by ABI • Functionalities include conversion (native <-> SRF <-> fastq, indexing, summary generation • SRA toolkit • Software development kit, SRA SDK, http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software • Functionalities include format conversion, column extraction, selection, etc. 25 16.08.2014 European Nucleotide Archive
Futures: sequence similarity search GATT AGAT GATCCGATGAG AGAA GCTCTAG CGAG TAGTCGA GGCT TAGA GAGGCT AGAGA AGACAG GCTTTAG CGACGC 26 16.08.2014 European Nucleotide Archive
Futures: leveraging community standardisation efforts Coherent communities exist that develop standards around what information to collect and how to represent it Systematic incorporation into data capture and presentation tools Validation against minimal standards and stamp of approval 27 16.08.2014 European Nucleotide Archive
Futures: data reduction strategies Disk space is finite! Intensity series have limited value for reuse Both future sample availability and application are factors Second base useful for polymorphism studies Proposal that minimal archived data includes sequence and quality 28 16.08.2014 European Nucleotide Archive
User defines coordinates on a reference, reads returned that relate to this part of the reference: Give me all reads that map to given gene in digital gene expression assay Give me all reads that provide support for a given polymorphism Give me all reads that provide support for a given splice model Calculation Up to date, but computationally heavy, with reference tracking issues Capture Consistent with literature, but submission is not straightforward Futures: the mapped read issue 29 16.08.2014 European Nucleotide Archive
People and funding Data submissions and management Sheila Plaister, Bob Vaughan, Ruth Akhtar, Petra ten Hoopen, Christopher Hunter, Richard Gibson Database programmers Ying Chang, Iain Cleland, Mikyung Jang, Rasko Leinonen, Quan Lin, Lawrence Bower, Siamak Sobhany, Gemma Hoad, Rajesh Radhakrishnan, Fehmi Demiralp, Vadim Kalunin, Neil Goodgame, Nadeem Faruque Database development and coordination Bob Vaughan, Nadeem Faruque, Rasko Leinonen, Guy Cochrane Sequencing data and tools (Sanger) Steven Leonard, James Bonfield Sequence search tools Guy Slater, Ewan Birney Data exchange collaborators NCBI, DDBJ EBI external services team Funding: European Molecular Biology Laboratory and Wellcome Trust 30 16.08.2014 European Nucleotide Archive
ENA points of access http://www.ebi.ac.uk/embl/Documentation/ENA-Reads.html http://www.ebi.ac.uk/ena/data/view/<SRA object accession> 31 16.08.2014 European Nucleotide Archive