Elixir WP2 Data Resources

Elixir WP2 Data Resources

The data resources workpackage is at the heart of the project, dealing with the very stuff for which ELIXIR was conceived The stuff of ELIXIR

genes and genomes transcripts and protein sequences patterns of gene expression three dimensional molecular structures interactions, pathways and processes metabolites and drugs The choreograph of molecular activity in the cell The data

Medicine and health Personal care Agriculture Food science Brewing and fermentation Forestry Fishery Environment Benefits

S. Palcy & A. de Daruvar Université Bordeaux 2 - France Research Domain • Biotechnology • Systems Biology • Computational biology • Pharmacology • Pharmacy • Toxicology • Biomedical informatics • Biostatistics • Chemoinformatics • Consumer goods company, Safety and Environment Assurance Centre • Pharmacognosy • Statistics • Physics • Wastewater treatment

Databases – expected use

The status quo – user survey

200 Databases 700 People 100 Institutions Total investment to date €308 million Annual cost €35 million About 1/3 of responders report NO costs 60% of polled databases didn’t respond It is almost certain that the non-responders are smaller on average 60 million web hits per month The EBI reports €22 million per year directly on databases Certainly EBI’s reporting is more complete than other sites Total European effort

The cost of the science supported by the infrastructure is possibly two orders of magnitude higher than that of the proposed ELIXIR infrastructure Comparative cost

UK 2005 ~ €3.8 Billion

User survey

531 Databases surveyed 208 Responded, 323 did not Responders Dead = no update since 2005 Non-responders

Databases by country

Security of the databases

Subject matter - keywords assigned

Specialised Molecular Data Resources Galperin (2005 NAR) • In 2007 more than 900 databases • ~30% in Europe • Most use core resources as reference data

User survey

200 databases, 100 institutions EBI – 27 databases Most institutions – 1 database Cumulative databases N Institutions

Costs per database €K Cumulative costs to date Cumulative €K Cumulative annual costs N Databases

Gigabytes per database Gigabytes N Databases

Modalities A further 20% intend to offer Web Services

Specialist/General

Usage restrictions, 28 Yes, 113 Technical limits, 67 Data downloadable in their entirety 32 charge commercial users 48 restrict reuse 23 report confidentiality constraints

Sources of funding National Institutional 25 22 18 12 12 18 16 Some non-European 14 Some commercial 12 European 38 No-formal funding

Cumulative citations per database Top 30 (in no particular order) CATH, Dali, DSSP, Ensembl, GeneCards, GO, IMGT/HLA, InterPro, MIPS PlantsDB, Pfam, PRINTS, PROSITE, SMR, UniProt, UniProtKB/Swiss-Prot, ArrayExpress, CAZy, CYGD, GOA-UniProt, HGMD, HSSP, MEROPS, miRBase, PDBREPORT, Rfam, SUPERFAMILY, GOLD, Reactome, STRING, BRENDA Cumulative Citations N Databases

Cumulative monthly hits Cumulative Hits N Dbs

Unique users N databases N users

Mirrors are the exception Substantial curation in a half of databases (30% do none) 42% consider themselves unique. 58% have comparable partners Only a fifth (of 58%) exchange data 70% collect usage data Mostly directed at bioinformaticians and biologists/bench scientists 30-45% of databases admit being incomplete/out-of-date Users not normally asked to register (<5%) Preferred usage metric = web hits 106 institutions also host tools Survey: miscellaneous points

Scope and nature of database provision some reflections from the committee

The committee endorsed ELIXIR’s biomolecular focus Cautioned against over-expansion of that focus For example, connect to medical data, rather than expanding the scope of ELIXIR Shared ontologies will be crucial to this, and should receive appropriate attention within ELIXIR Biologically active small molecules are in scope important to have public domain chemical resources available ELIXIR scope

Core databases are complete collections of universal scientific value Core databases require that the providing institution takes on long-term responsibility Investigator-led databases scope and persistence reflects the interests research group not core funded from appropriate research funds Examples of core databases: UniProt, EMBL-Bank, MSD, Ensembl and ArrayExpress Non-core databases can be candidates for core Mechanism to move databases in and out of core To create or discontinue database projects Core and non-core distinction

Support databases are built to support the operation of the core databases or to be used in conjunction with them to increase their value. For example they may provide controlled vocabularies for a range of core databases (say organism names). Investigator-led databases are typically the product of research groups (though they may well be served to external users). Their content reflects the research interests of their provider (E.g., documenting catalytic sites). Specialist databases handle data whose structure cannot easily be represented in the more general database (say immunoglobulins). Derivative/Summarising databases combine and organise data from a range of other databases, such as a non-redundant set of coding sequences. Non-core databases

Where data can be identified to individuals access restrictions associated with confidentiality, consent and ethics must be applied This must not be confused with protectionism Consent and confidentiality

Data discussed in a publication should be made available before or at the time of publication In some domains early publication might actually be a far from complete analysis of a complex data set Even where this is the case the normal rules should apply for biomolecular data These norms apply to “conventional”, “hypothesis-driven” research Projects whose funding is justified by the creation of shared data collections should make their data available as soon as they are useful Data release and publication

ELIXIR - European focus, with global perspective Typically core resources are: are embedded in global collaborations have global data exchange agreements. Global perspective

Node EBI Node Node Data structure Core Non-core

We cannot do everything – we will have to review existing databases and consider proposals for new data bases.

Demand Scientific case User demand Funding agency demand Data generator demand Journal demand Standardisation and connectivity Appropriateness to ELIXIR In scope Freely available Community support Arrangements with global peers Strategic need, e.g, as a European player Can it be left to another provider? Can ELIXIR be globally competitive? Is persistence assured? New and proposed data resources

Scale and cost Database size and complexity Cost Staff requirement Data flow rates Usage Volume of usage Number of users Citations Scale and cost effectiveness

Related domains

Large resources in related disciplines BRENDA IMGT Pasteur DBs Model organism resource examples Specialist biomolecular data resource examples Medical data resources Core biomolecular resources Biodiversity data resources SGD Flybase Chemical data resources MGD Eumorphia/ Phenotypes Mutants Mouse Atlas

Medical data resources Core biomolecular resources

Data sharing is the norm in the biomolecular domain ELIXIR should espouse the strongest possible public domain principles (Question ?) Data supported by the infrastructure should be downloadable in their entirety and subject to no restrictions in use and reuse (Question ?) Insistence on acknowledgement is acceptable Prohibiting the distribution of a modified data collection in a form which could be confused with the original is acceptable Service organisations should exist in a research context Collaboration and avoidance of duplication in core data archives essential Creative competition on services desirable Data bases produced by research with primary responsibilities only to their research group are not Elixir and should be identified as such (hobby databases) Elixir data resources must connect with their global context Standardisation and interoperability are crucial Principles & Recommendations

Elixir WP2 Data Resources

Elixir WP2 Data Resources

Presentation Transcript

WP2

WP2: DATA COLLECTION AND METADATA COMPILATION

ELIXIR

WP2: Data Management

WP2

Elixir Vision

ELIXIR

WP2. Data Management

Grid Data Management (WP2)

WP2 - Data Management

Elixir: Overview

WP2 Data and Compute Cloud Platform

WP2: Data Management

WP2

WP2: Tools

WP2: Data Management

ECOOP Data Management System (T2.2/WP2)

Elixir Revolution

Elixir tutorial

Elixir: Overview

WP2: Data Management

CONCEPTUALIZATION (WP2)