70 likes | 152 Views
Afternoon session: The archival problem and infrastructure for solutions. Prof John R Helliwell john.helliwell@manchester.ac.uk. Interactive Publications and the Record of Science ICSTI Winter Workshop Paris , Monday, February 8, 2010. JRH research, publications background.
E N D
Afternoon session:The archival problem and infrastructure for solutions Prof John R Helliwelljohn.helliwell@manchester.ac.uk Interactive Publications and the Record of Science ICSTI Winter Workshop Paris, Monday, February 8, 2010
JRH research, publications background • Professor of Structural Chemistry • DSc Physics • Approx 200 research papers; 5 books (2 as monographs) • Editor-in-Chief of journals published by IUCr 1996-2005 (Acta Crystallographica, Journal of Applied Crystallography, Journal of Synchrotron Radiation) • IUCr Representative to ICSTI
What needs to be in place for interactive content to be available in the future? • Emulation of legacy software environments? • How to package, identify and interlink the independent components of a complex article? • Can we handle distributed articles? • Can we identify and retrieve slices through large archived data sets? • How to work with changing data sets? • What is worth keeping anyway?
The importance of data for publication • Interactive figures depend on data • Semantic value is added to data, or forms additional (meta)data • Fundamental principle of research publication: the work is reproducible • exact experimental conditions are given • data are preserved/accessible • in recent case of animal clones, ‘samples’ also had to be made available upon request • Increasing requirement to archive primary data
Data and publication in crystallography • A reasonable state of affairs ... • molecular models archived by journals (CIFs: interactive figures) • reduced diffraction data preserved by databases or some journals (data validation; retracted papers) • ... but with room for improvement • molecular dynamics for the crystalline state difficult to interpret; whole diffraction images preferable for archiving • scientific fraud in structural biology/chemistry: archiving of diffraction images provides better security against such frauds • but diffraction data images from crystal diffraction experiments are uncompressed, file sizes large. Thus limited appetite (and resources) to preserve it
Crystals, diffraction spots and smears, molecules and dynamics Zoom
Some archive technical details • Protein Data Bank: 60,000 macromolecular structures • 80% derived from crystal structure analysis • archive doubling in size every 2 to 3 years • coordinate file for typical protein ~0.25 Mb; derived from core diffraction data of 1Mb; extracted from ~1 Gb of diffraction images data. • data sets need to be archived in quintuplicate (EBI Director to JRH Jan 12 2010) • thus 60,000 x 1Gb x 5= 300 Terabytes of primary data for PDB currently • cost estimate for PDB to be the sole primary archive provider ca GBP 200,000 per annum: unable to take on this responsibility on • Currently researcher agrees to hold project diffraction images for at least 5 years and release them upon request; no archiving commitment from research sponsor • Solution in distributed or federated archives (experimental facilities / laboratories / data repositories)?