270 likes | 449 Views
A Common Platform for 'Omics Data BioInvestigation Index Infrastructure and related reporting standards. Susanna-Assunta Sansone, PhD (Project Coordinator) sansone@ebi.ac.uk www.ebi.ac.uk/net-project. Literature and ontologies CitExplore, GO. Nucleotide sequence EMBL-Bank. Genomes
E N D
A Common Platform for 'Omics DataBioInvestigation Index Infrastructure and related reporting standards Susanna-Assunta Sansone, PhD (Project Coordinator) sansone@ebi.ac.uk www.ebi.ac.uk/net-project
Literature and ontologies CitExplore, GO Nucleotide sequence EMBL-Bank Genomes Ensembl, Integr8 Proteomes UniProt, PRIDE Gene expression ArrayExpress Protein structure MSD Protein families, motifs and domains InterPro Chemical entities ChEBI Protein interactions IntAct Pathways Reactome These EBI databasescollaborate and exchange data with other systems world wide Systems BioModels The EBI’s databases • To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress
Nucleotide sequence www.insdc.org Genome annotation www.geneontology.org Genomics Standards Consortium (GSC) gensc.org Microarray and Gene Expression Data (MGED) www.mged.org Protein sequence www.uniprot.org Protein structure www.wwpdb.org HUPO- Proteomics Standards Initiative (PSI) Psidev.sf.net Systems modelling standards www.sbml.org Cheminformatics www.ebi.ac.uk/chebi Pathways www.reactome.org www.biopax.org Metabolomics Standards Initiative (MSI) www.metabolomicssociety.org Standards development – international collaborations
Many standardization initiatives • Wide variety of players, e.g. • Standards Developing Organizations (SDOs) • Grass-root movements • International projects • Working groups • Heterogeneous focus - beyond reporting requirements- e.g. • Broader understanding of the use of omics’ data (e.g. FDA, EPA) • Agreed world-wide recommendations (e.g. NAS, OECD) • Measurements and methods validation (e.g. NIST, ECVAM) • Multi-stakeholders • Academics, industries, governmental and regulatory bodies • Manufacturers, software vendors, journal editors and funders
Many initiatives but two main driving forces • Driving force behind SDOs • Regulatory bodies • Data review • Data submission models • - Transport file, minimal burden on sponsors • Pharma/tox metadata-centric vision • Driving force behind grass-root standards initiatives • Research communities and database developers • Reporting standards for deposition and exchange • Data reposition tools • - Mandatory fields, use of public terminologies • OMICS data-centric vision
Reporting standards Native Data Files Source and Characteristics Sample Preparation Computational Analysis Design Instrumental Analysis (MS, NMR, etc.) Treatments Data Files DATA FILES METADATA • Need to understand the resulting data in context • We need to be able to describe the laboratory workflow (metadata) • The challenges we need to overcome • Large in volume: lots of data types and metadata! • Lots of free text descriptions: hard to mine, subject to mistakes! • Babel of terminologies: lack of definitions, hard to map! • Heterogeneous file formats: software lock-in!
CONTENT • Minimal/core information to be reported THESAURI XML LIST TAXONOMY ONTOLOGY Domain Experts • SYNTAX • Format used for communication • SEMANTICS • Terminology used for description TABULAR Funding agencies’ data policies Reporting standards: types, developers and users
Can ‘standards’ deal with complex studies? • Research is multidisciplinary and multi-technology • Biological, biomedical, environmental and other studies measure a variety of endpoints and employ one or more technologies, both conventional and high-throughput • Example of such complex studies • Study looking at the effect -on a number of subjects- of a compound inducing liver damage by characterizing/measuring • - the metabolic profile (endpoint) of urine by MS (technology) • - protein expression (endpoint) in liver by MS (technology) • - gene expression (endpoint) in liver by DNA microarray (technology) • and conducting conventional histopathological analysis
Long development process, requiring synergies! • Standardization activities operate in a single domain • But research is multi-disciplinary and multi-technology • Standards should stand alone but also function together • Build it in a modular way, maximizing interactions • Share common modules, where applicable • Interoperability is crucial • Assists the user base to comply to or require compliance to • - Data producers and database managers, journal reviewers etc. • Optimize development of tools, saving time and costs • - Manufactures and vendors covering in multiple technologies • Leverage the commonality through increased synergy • Extensive community liaisons is required • Little or no funds are available
CONTENT • MIBBI (mibbi.sf.net) and • EQUATOR (www.equator-network.org) XML LIST THESAURI TAXONOMY ONTOLOGY TABULAR Domain Experts • SEMANTICS • OBO Foundry (obofoundry.org) • SYNTAX • FuGE XML-based (fuge.sf.net) • and ISA-TAB tabular (isatab.sf.net) Some synergistic activities have started …..
http://mibbi.sf.net MIBBI Project – Standard Content • New international collaboration (2007) • Communities developing a ‘minimum information’ checklist • Define what is the ‘core essentials’ to be reported • Describe not prescribe! • Phase 1: Portal as a‘one-stop shop’ (ONGOING) • For researchers, journal editorsandreviewers, andfunders • To discover (whether there are) checklists for a particular domain • To raise awareness of the scope and progress of extant efforts • To facilitate investigation of overlaps and gaps between checklists
http://mibbi.sf.net MIBBI Project – Standard Content • New international collaboration • Communities developing a ‘minimum information’ checklist • Define what is the ‘core essentials’ to be reported • Describe not prescribe! • Phase 1: Portal as a‘one-stop shop’ (ONGOING) • For researchers, journal editorsandreviewers, andfunders • To discover (whether there are) checklists for a particular domain • To raise awareness of the scope and progress of extant efforts • To facilitate investigation of overlaps and gaps between checklists • Phase 2: Foundry for integration • To refactor the checklists • Create integrable checklist modules • A suite of self-consistent, clearly bounded and orthogonal • Biology and technology delineated modules
http://mibbi.sf.net MIBBI Project – Link with EQUATOR • Minimal information guidelines to report health research, e.g. • CONSORT Statement (randomised controlled trials) • QUOROM, recently renamed PRISMA (systematic reviews of randomised trials) • STARD (diagnostic accuracy studies) • STROBE (observational studies) • REMARK (tumour marker prognostic studies)
http://obofoundry.org OBO Foundry – Standard Semantics • Prospective standard (since 2006) • To guarantee interoperability of the ontologies from the start • - Opposite to post hoc mappings • To ensure division of the labour and avoid duplication • - Where common terms exists across domains • To overcome the different grade of formal rigor • - Different degree of completeness, variable quality, different update policies • Framework of rules governing best practices • To counteract the current policy of ad hoc creation of ontologies • To create a complete suite of orthogonal and interoperable ontologies • - Common architecture, versioning, documentation, OWL/OBO format, ect
http://obofoundry.org • Ontologies under OBO are over 80 • Some are being constructed ab initio, some are being reviewed • Developers have agreed to accept the set of rules
http://obofoundry.org OBO Foundry – Standard Semantics • Framework of rules governing best practices, including • Open • Common format language (OBO, OWL) • Orthogonal (avoid domain overlap) • Common architecture (Relation Ontology) • Update • Unique identifier space in OBO • Versioning (backward compatibility) • Documentation Credit to B. Smith • Example • Body weight • PATO:weight that inheres_in CARO:whole_organism • Dead cell • - CELL TYPE root node: cell has_quality PATO:dead
ISA-TAB and BioInvestigtion Index – Rationale • Store multi-omics studies produced by collaborators • CarcinoGenomics and NuGO (European Consortia) • Network of +40 research institutes – toxicogenomics and nutrigenomics • Committed to place data in the public domain - fulfil grant’s requirements • Natural Environmental Research Council (NERC) • - UK funding agency – environmental research programmes • - NERC Bioinformatics Center (NEBC) supports omics-based programmes • - Committed to place data in the public domain - fulfil NEBC data polic • Collect and present multi-omics studies uniformly • Leverage on existing public repositories at EBI • - Develop an unified submission and query interface • Ensure compliance to standards, where these exists
Current situation @ EBI RETRIEVAL MGED standards HUPO-PSI standards NO common representation of complex studies Independent databases, different metadata representation, format, diverse terminologies etc. Pride ArrayExpress Transcriptomics data files + required experimental descriptors Proteomics data files + required experimental descriptors STORAGE Existing production systems MIAMExpress Mage TAB Proteome Harvest Mage-ML PSI-XML(s) SUBMISSION
Common format for submissions – ISA-TAB • Investigation /Study /Assay TABular • Format created to address EBI internal ‘issues’ • It builds on this existing paradigm of the MAGE-TAB • Has additional features making it a more general framework • It shares the same motivation for the use of tab-delimited text files • Easily created programmatically or by spreadsheet software, e.g. Excel • The development of ISA-TAB has been opened up and shared with collaborators • Proposal was discussed, evaluated and finalized at the 1st and 2nd ISA-TAB workshop, at EBI in Dec 07 and Jun 08 • http://www.ebi.ac.uk/net-project/projects.html#workshop
ISA-TAB contributors Grass-root initiatives developing de facto standards in biological, biomedical, environmental and omics domains
ISA-TAB workshops’ outcome and work to date • General consensus around this cross-platform format • To tackle immediate needs to deal with multi-omics studies • Two use cases for ISA-TAB format • As submit/export = exchange format • - Suitable also for researchers with little or no bioinformatics support • - Easily created programmatically or by spreadsheet software • Implementers decide how to regulate its use • E.g. by enforcing minimum requirements or use of ontologies • However, ISA-TAB files with all fields left empty are syntactically valid, as are those where all fields are filled with free text values rather than controlled vocabulary or ontology terms • User-friendly presentation layer for (FuGE-based) XML-based formats • - Library of XLST scripts in progress
More information on the ISA-TAB, examples and summary slides of the 2nd workshop: http://isatab.sourceforge.net The ‘release candidate 1, ISA-TAB v1’ will be released ~ mid September
BioInvestigation Index Infrastructure Will provide such common representation of complex studies (first prototype Fall 08) funded by: www.ebi.ac.uk/net-project Browse, query and export interface RETRIEVAL ISArchive MeDa Pride Bio-Investigation Index ArrayExpress Experimental descriptors, sample-data relationship + other type of assays Metabolomics data files Transcriptomics data files + required experimental descriptors Proteomics data files + required experimental descriptors STORAGE ISArchive parsed by import layer and dispatched to relevant databases Existing production systems ISAcreator - submission tool Create and edit the metadata in ISA-TAB format, using ontologies ISArchive SUBMISSION Add data files Submit to BioInvIndex
ISA-TAB, MIBBI, OBO Foundry ontologies Reporting standards Browse, query and export interface RETRIEVAL ISArchive MeDa Pride Bio-Investigation Index ArrayExpress Experimental descriptors, sample-data relationship + other type of assays Metabolomics data files Transcriptomics data files + required experimental descriptors Proteomics data files + required experimental descriptors STORAGE ISArchive parsed by import layer and dispatched to relevant databases Existing production systems ISAcreator - submission tool Create and edit the metadata in ISA-TAB format, using ontologies ISArchive SUBMISSION Add data files Submit to BioInvIndex
ISA-TAB format FDA’s NCTR ISA-TAB format Exchange and interoperability Browse, query and export interface RETRIEVAL ISArchive MeDa Pride Bio-Investigation Index ArrayExpress Experimental descriptors, sample-data relationship + other type of assays Metabolomics data files Transcriptomics data files + required experimental descriptors Proteomics data files + required experimental descriptors STORAGE ISArchive parsed by import layer and dispatched to relevant databases Existing production systems SUBMISSION Submit to BioInvIndex
Acknowledgements www.ebi.ac.uk/net-project Coordination: Susanna-Assunta SansoneTechnical Coordination: Philippe Rocca-SerraOntology: Daniel Schober (Post-Doc) MIBBI: Chris Taylor (NEBC Bioinformatician) BioInvIndex: Nataliya Sklyar (Software Engineer) BioInvIndex: Marco Brandizi (Software Engineer)ISAcreator: Eamonn Maguire (Software Engineer) BioInvIndex development funds Standards and ontology workshops funds