120 likes | 198 Views
A vision involving raw data archiving via local archives as a supplement to the existing processed data archives (PDB, CSD, ICDD etc). John R. Helliwell, Brian McMahon, Tom Terwilliger. john.helliwell@manchester.ac.uk bm@iucr.org terwilliger@lanl.gov. Options.
E N D
A vision involving raw data archiving via local archives as a supplement to the existing processed data archives (PDB, CSD, ICDD etc) John R. Helliwell, Brian McMahon, Tom Terwilliger john.helliwell@manchester.ac.uk bm@iucr.org terwilliger@lanl.gov
Options • Do nothing for ensuring raw data archiving • Do what we can eg via centralised facilities raw data archiving along with Universities own data archives both as supplements to the processed data archiving at the CSD and PDB etc; or at the very least by personal web page links • Seek a blue skies solution where all raw data are compulsorily archived at centralised repositories
During the last year detailed options were sketched out: Firstly • At the Launch Meeting of the DDD WG in Madrid it was suggested that a pilot project involving digital object identifier (DOI) registrations of a test group of data sets could be established; this would be led by an SR Facility that is keeping a raw data archive in any case. • This was enthusiastically supported and DLS agreed to take this forward with 100 MX data sets. • JRH in parallel continued to investigate the local University reprint repository archive option, which accepts data (in U. Manchester) for ‘small data sets’; this led to finding out that U. Manchester in any case was setting up a data archive for its researchers so as to satisfy funding bodies requirements of its grant holders (launch expected September 2012). • The local University Data Archive would be the vehicle for locally measured diffractometer data sets and also perhaps those from SR and neutron Facilities that made it into publications by academics at that University.
During the last year detailed options were sketched out: Secondly • A draft proposal was also written by JRH exploring the possibility of Acta Crystallographica Section E: Structure Reports Online hosting raw data (the set of diffraction data images) for each structure • Preliminary analysis, in discussions with IUCr Journals Chester, identified the major bottleneck as network bandwidth (Chester has 2 x 2Mbps; but there were also concerns about bandwidth limits on international pipes, especially to individual laboratories) • Also building costs would be involved to upgrade a server room for higher-capacity storage; although preliminary estimates suggested per-article storage overhead could be sustainable within the journal's open-access charging model
JRH with L K-B write article with links to raw data sets • Tanley, S. W. M., Schreurs, A. M. M., Helliwell, J. R. and Kroon-Batenburg, L. M. J. (2012).Experience with exchange and archiving of raw data: comparison of data from two diffractometers and four software packages on a series of lysozyme crystals (2012). J. Appl. Cryst. Submitted. • Explores comparative metadata associated with different instruments, emphasising benefit of standard ontologies (e.g. imgCIF) • Demonstrates scientific usefulness of detailed data reanalysis
New reports appear from learned bodies • In addition to ICSU’s Strategic Committee on Data Report • The Royal Society (June 2012) enthusiastically endorses the importance of access to data; their Committee defines data in its view as: • and states : For example, the annual cost of managing the world’s data on protein structures in the world wide Protein Data Bank is less than 1% of the cost of generating that data. • Their data definitions unfortunately seem to miss the distinction between processed data and raw data.
Is a Blue Skies option still out of the question? • One or more centralised global repositories might take on the raw data archiving? • The PDB has given a careful and detailed analysis at this Workshop.
Is the option of localised repositories (near to where data are measured) secure yet? • CSynR has started a survey of SR Facilities (8 reported so far) suggesting that this is a promising as an option; but each SR facility emphasised that they are not to be regarded as an archive. Neither • instantaneous delivery of data • provision of data sets certified to be 100% ‘free of data corruption could be guaranteed. • The Universities Data Archive experience, even at the most advanced in their planning (e.g. University of Manchester), is yet to be seen in practice, e.g. with respect to the two issues mentioned in point 1 above.
Possibilities for SR facility temporary repositories • Most synchrotron facilities already maintain simple archives of users’ data • Perhaps 99% access and availability is plenty (and better than nothing) • A simple approach • Save raw data at SR, tagged with identifier(s). Optimally tag meta-data also. (Perhaps one DOI per dataset generated at this time and provided to user and stored in image headers) • Processing programs keep track of identifiers so that processed data is linked to raw data • On PDB deposition, the DOI is deposited. On publication it is listed. • PDB notifies SR, the flagged data are copied to a long-term storage location • Perhaps some day the PDB pulls this data in
Might we still need additional ‘fallback’ positions? • Corresponding authors set up web links to their data sets that underpin their publications. • These may be or may be not DOI linked: such a requirement would be difficult to enforce although journals could ‘strongly recommend’. • How would such a method for data archiving and access by readers be kept up to date,e. g. in the event of an author retiring (or what to do after their death?).
Conclusions • There is an enthusiasm and encouragement to archive more than derived or processed data in many areas of science besides our own. • The crystallographic community prides itself in making its processed data accompany its publications; indeed it has been obligatory these last 10 years or so. • We have three practical options in the near future to extend these principles to our raw data; • via the local Data Archive • via synchrotron data storage • Or via the corresponding author setting up a personal link to datasets underpinning publications on their personal websites.
So, we suggest a proposal: • We suggest that we adopt the above three practical options to make feasible a recommendation to the IUCr Executive Committee that: • Authors should provide a permanent and prominent link from an article to the raw data sets underpinning a journal publication • with a view to making this a formal requirement on authors at such time as the community has adopted raw data deposition as a routine procedure.