440 likes | 590 Views
Elixir WP2 Data Resources. The data resources workpackage is at the heart of the project, dealing with the very stuff for which ELIXIR was conceived. The stuff of ELIXIR. genes and genomes transcripts and protein sequences patterns of gene expression three dimensional molecular structures
E N D
The data resources workpackage is at the heart of the project, dealing with the very stuff for which ELIXIR was conceived The stuff of ELIXIR
genes and genomes transcripts and protein sequences patterns of gene expression three dimensional molecular structures interactions, pathways and processes metabolites and drugs The choreograph of molecular activity in the cell The data
Medicine and health Personal care Agriculture Food science Brewing and fermentation Forestry Fishery Environment Benefits
S. Palcy & A. de Daruvar Université Bordeaux 2 - France Research Domain • Biotechnology • Systems Biology • Computational biology • Pharmacology • Pharmacy • Toxicology • Biomedical informatics • Biostatistics • Chemoinformatics • Consumer goods company, Safety and Environment Assurance Centre • Pharmacognosy • Statistics • Physics • Wastewater treatment
200 Databases 700 People 100 Institutions Total investment to date €308 million Annual cost €35 million About 1/3 of responders report NO costs 60% of polled databases didn’t respond It is almost certain that the non-responders are smaller on average 60 million web hits per month The EBI reports €22 million per year directly on databases Certainly EBI’s reporting is more complete than other sites Total European effort
The cost of the science supported by the infrastructure is possibly two orders of magnitude higher than that of the proposed ELIXIR infrastructure Comparative cost
UK 2005 ~ €3.8 Billion
531 Databases surveyed 208 Responded, 323 did not Responders Dead = no update since 2005 Non-responders
Specialised Molecular Data Resources Galperin (2005 NAR) • In 2007 more than 900 databases • ~30% in Europe • Most use core resources as reference data
200 databases, 100 institutions EBI – 27 databases Most institutions – 1 database Cumulative databases N Institutions
Costs per database €K Cumulative costs to date Cumulative €K Cumulative annual costs N Databases
Gigabytes per database Gigabytes N Databases
Modalities A further 20% intend to offer Web Services
Usage restrictions, 28 Yes, 113 Technical limits, 67 Data downloadable in their entirety 32 charge commercial users 48 restrict reuse 23 report confidentiality constraints
Sources of funding National Institutional 25 22 18 12 12 18 16 Some non-European 14 Some commercial 12 European 38 No-formal funding
Cumulative citations per database Top 30 (in no particular order) CATH, Dali, DSSP, Ensembl, GeneCards, GO, IMGT/HLA, InterPro, MIPS PlantsDB, Pfam, PRINTS, PROSITE, SMR, UniProt, UniProtKB/Swiss-Prot, ArrayExpress, CAZy, CYGD, GOA-UniProt, HGMD, HSSP, MEROPS, miRBase, PDBREPORT, Rfam, SUPERFAMILY, GOLD, Reactome, STRING, BRENDA Cumulative Citations N Databases
Cumulative monthly hits Cumulative Hits N Dbs
Unique users N databases N users
Mirrors are the exception Substantial curation in a half of databases (30% do none) 42% consider themselves unique. 58% have comparable partners Only a fifth (of 58%) exchange data 70% collect usage data Mostly directed at bioinformaticians and biologists/bench scientists 30-45% of databases admit being incomplete/out-of-date Users not normally asked to register (<5%) Preferred usage metric = web hits 106 institutions also host tools Survey: miscellaneous points
Scope and nature of database provision some reflections from the committee
The committee endorsed ELIXIR’s biomolecular focus Cautioned against over-expansion of that focus For example, connect to medical data, rather than expanding the scope of ELIXIR Shared ontologies will be crucial to this, and should receive appropriate attention within ELIXIR Biologically active small molecules are in scope important to have public domain chemical resources available ELIXIR scope
Core databases are complete collections of universal scientific value Core databases require that the providing institution takes on long-term responsibility Investigator-led databases scope and persistence reflects the interests research group not core funded from appropriate research funds Examples of core databases: UniProt, EMBL-Bank, MSD, Ensembl and ArrayExpress Non-core databases can be candidates for core Mechanism to move databases in and out of core To create or discontinue database projects Core and non-core distinction
Support databases are built to support the operation of the core databases or to be used in conjunction with them to increase their value. For example they may provide controlled vocabularies for a range of core databases (say organism names). Investigator-led databases are typically the product of research groups (though they may well be served to external users). Their content reflects the research interests of their provider (E.g., documenting catalytic sites). Specialist databases handle data whose structure cannot easily be represented in the more general database (say immunoglobulins). Derivative/Summarising databases combine and organise data from a range of other databases, such as a non-redundant set of coding sequences. Non-core databases
Where data can be identified to individuals access restrictions associated with confidentiality, consent and ethics must be applied This must not be confused with protectionism Consent and confidentiality
Data discussed in a publication should be made available before or at the time of publication In some domains early publication might actually be a far from complete analysis of a complex data set Even where this is the case the normal rules should apply for biomolecular data These norms apply to “conventional”, “hypothesis-driven” research Projects whose funding is justified by the creation of shared data collections should make their data available as soon as they are useful Data release and publication
ELIXIR - European focus, with global perspective Typically core resources are: are embedded in global collaborations have global data exchange agreements. Global perspective
Node EBI Node Node Data structure Core Non-core
We cannot do everything – we will have to review existing databases and consider proposals for new data bases.
Demand Scientific case User demand Funding agency demand Data generator demand Journal demand Standardisation and connectivity Appropriateness to ELIXIR In scope Freely available Community support Arrangements with global peers Strategic need, e.g, as a European player Can it be left to another provider? Can ELIXIR be globally competitive? Is persistence assured? New and proposed data resources
Scale and cost Database size and complexity Cost Staff requirement Data flow rates Usage Volume of usage Number of users Citations Scale and cost effectiveness
Large resources in related disciplines BRENDA IMGT Pasteur DBs Model organism resource examples Specialist biomolecular data resource examples Medical data resources Core biomolecular resources Biodiversity data resources SGD Flybase Chemical data resources MGD Eumorphia/ Phenotypes Mutants Mouse Atlas
Large resources in related disciplines BRENDA IMGT Pasteur DBs Model organism resource examples Specialist biomolecular data resource examples Medical data resources Core biomolecular resources Biodiversity data resources SGD Flybase Chemical data resources MGD Eumorphia/ Phenotypes Mutants Mouse Atlas
Medical data resources Core biomolecular resources
Data sharing is the norm in the biomolecular domain ELIXIR should espouse the strongest possible public domain principles (Question ?) Data supported by the infrastructure should be downloadable in their entirety and subject to no restrictions in use and reuse (Question ?) Insistence on acknowledgement is acceptable Prohibiting the distribution of a modified data collection in a form which could be confused with the original is acceptable Service organisations should exist in a research context Collaboration and avoidance of duplication in core data archives essential Creative competition on services desirable Data bases produced by research with primary responsibilities only to their research group are not Elixir and should be identified as such (hobby databases) Elixir data resources must connect with their global context Standardisation and interoperability are crucial Principles & Recommendations