1 / 18

Research data archiving and publication in a well-defined physical science discipline

This forum focuses on the archiving and publication of research data in the field of physical sciences. It explores the role of publishers in data management and promotes the use of the crystallographic data file format (CIF).

pmorales
Download Presentation

Research data archiving and publication in a well-defined physical science discipline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DCC Research Data Management Forum 8Research data management - engaging with the publishers, Southampton, 29-30 March 2012 Research data archiving and publication in a well-defined physical science discipline Brian McMahon International Union of Crystallography 5 Abbey Square Chester CH1 2HU UK bm@iucr.org

  2. International Union of Crystallography • International Scientific Union • Publishes 8 research journals: • Acta Crystallographica Section A: Foundations of Crystallography • Acta Crystallographica Section B: Structural Science • Acta Crystallographica Section C: Crystal Structure Communications • Acta Crystallographica Section D: Biological Crystallography • Acta Crystallographica Section E: Structure Reports Online • Acta Crystallographica Section F:Structural Biology and Crystallization Communications • Journal of Applied Crystallography • Journal of Synchrotron Radiation • Publishes major reference work International Tables for Crystallography (8 volumes) • Promotes standard crystallographic data file format (CIF)

  3. Crystallographic (X-ray diffraction) experiment

  4. Data relevant to publication • Data can mean any or all of: • raw measurements from an experiment • processed numerical observations • derived structural information • variable parameters in the experimental set-up or numerical modelling and interpretation • bibliographic and linking information • We make no fundamental distinction between data and metadata – metadata are data that are of secondary interest to the current focus of attention. (1) (3) (2) (4) (5)

  5. CIF example: simple tag, value structure #............................................................................. data_manuscript _manuscript_summary ; This is some dummy text to show how a multiple data-block STAR file works! ; data_crystal_structure _chemical_formula'C13 H12 05‘ _chemical_name ; 3-(2,5-dihydro-4-hydroxy-5-oxo-3-phenyl-2-furyl)propionic acid ; _publication_title ; Structure of WF-3681, 3-(2,5-Dihydro-4-hydroxy-5-oxo-3-phenyl-2-furyl)propionic Acid. ; _cell_a18.757(8) _cell_b7.282(2) _cell_c17.511(8) _cell_alpha90 _cell_beta91.20(3) _cell_gamma90 _cell_volume2391(3) _symmetry_space_group'-C 2yc' loop_ _symmetry_pos_in_xyz 'x,y,z''-x,-y,-z''-x,y,1/2-z‘ 'x,-y,1/2+z‘'1/2+x,1/2+y,z‘'1/2-x,1/2-y,-z‘ '1/2-x,1/2+y,1/2-z''1/2+x,1/2-y,1/2+z' Supports variety of data types, including string, extended text, numerical, experimental numerical - value(s.u.), comments

  6. CIF as a vehicle for article submission _publ_section_title ;\ Diaquahexa-\m~2~-dichloroacetato-\m~3~-oxido-\ tetrahydrofurandiiron(III)manganese(II)diiron(III) ; loop_ _publ_author_name _publ_author_address 'Sadeghi, Omid' ; Department of Chemistry General Campus ShahidBeheshti University Tehran 1983963113 Iran ; 'Ng, SeikWeng' ; Department of Chemistry University of Malaya 50603 Kuala Lumpur Malaysia ; _publ_section_abstract ; In the oxido-centered title compound, [Fe~2~Mn(C~2~HCl~2~O~2~)~6~O(C~4~H~8~O)(H~2~O)~2~], the central O atom is linked to three metal atoms, which are themselves each linked to four dichloroacetate anions and are in a triangular configuration. Two of the metal atoms are each coordinated by a water molecule, whereas the third is coordinated by a tetrahydrofuran molecule. In the crystal, adjacent molecules are linked by O---H...O and O---H...Cl hydrogen bonds across centers of inversion, generating a hydrogen-bonded chain along the <i>c</i> axis. The Mn^II^ centers are disordered with respect to the Fe^III^, and the same metal site is occupied by 1/3Mn + 2/3Fe. ; data_I _audit_creation_method SHELXL-97 _chemical_name_systematic ;\ Diaquahexa-\m~2~-dichloroacetato-\m~3~-oxido-\ tetrahydrofurandiiron(III)manganese(II) ; _chemical_formula_iupac '[Fe2 Mn (C2 H Cl2 O2)6 O (C4 H8 O) (H2 O)2]' _chemical_formula_sum 'C16 H18 Cl12 Fe2 Mn O16' _chemical_formula_weight 1058.34 _symmetry_cell_setting triclinic _symmetry_space_group_name_H-M 'P -1' _symmetry_space_group_name_Hall '-P 1' loop_ _symmetry_equiv_pos_as_xyz 'x, y, z' '-x, -y, -z' _cell_length_a 9.380(1) _cell_length_b 13.316(1) _cell_length_c 15.432(1) _cell_angle_alpha 90.131(1) _cell_angle_beta 100.067(1) _cell_angle_gamma 97.677(1) _cell_volume 1880.1(2) _cell_formula_units_Z 2

  7. Information exchange/archive standard Crystallographic Information Framework • File format (CIF) • Tool chain: parsers, libraries, editors, database loaders • Fortran 77, C, C++, Python, Perl • Interchangeable with other formats (PDBML, CML) • Information model schemas (CIF dictionaries) • Schema language (Dictionary Definition Language, DDL) • Domain ontologies (coreCIF, mmCIF, pdCIF, imgCIF) • Integrated approach • Does not differentiate between 'data', 'metadata', 'publication content' etc. International Tables for Crystallography, Volume G: Definition and Exchange of Crystallographic Data • Sydney Hall and Brian McMahon (Editors) • ISBN: 978-0-470-68910-3 • Hardcover. 606 pages. First edition August 2005

  8. A word about standards Standards are great: everyone should have one! A screenshot from an editorial workstation ca 1992, demonstrating that standards have great practical use even before they are universally adopted. The file STARIN is in CIF format, it is a file into which numeric data have been manually keyboarded. From it can be derived input files ('card decks') for different contemporary analysis software; their various output files can be mined to generate a coherent validation report. Nowadays, with CIF a universal standard in the field, validation is run by a single service (checkCIF).

  9. Why publish data? • Some reasons: • To enhance the reproducibility of a scientific experiment • To verify or support the validity of deductions from an experiment • To safeguard against error • To safeguard against fraud • To allow other scholars to conduct further research based on experiments already conducted • To allow reanalysis at a later date, especially to extract 'new' science as new techniques are developed • To provide example materials for teaching and learning • To provide long-term preservation of experimental results and future access to them • To permit systematic collection for comparative studies

  10. IUCr journal policy(0) Bibliographic and linking 'metadata' • Standard propagation through publishing industry channels: • Abstracting and indexing services • Bibliographic databases • Digital Object Identifiers (DOI) – registered through CrossRef • OpenURL discovery services • RSS feeds • Indexing by Google Scholar, Microsoft etc. • JISC #jiscopenbib • Collaboration with Open Knowledge Foundation, Unilever Cambridge Centre for Molecular Informatics; • support from Public Library of Science, Oxford University ('Open Citation' project) • decouple bibliographic metadata from IP issues • pragmatic metadata harvesting from diverse schemas

  11. IUCr journal policy(1) Derived data • For crystal/molecular structures with small unit cells (inorganic, metal-organic, organic) • Atomic coordinates, anisotropic displacement parameters, molecular geometry and intermolecular contacts • Experimental parameters, unit-cell dimensions, space group information • Reference and modulated structure subsystems for aperiodic composite structures • must be supplied in CIF format as an integral part of article submission and are freely available for download • For biological macromolecular structures • Atomic coordinates, anisotropic or isotropic displacement parameters, space group information, secondary structure and information about biological functionality • must be deposited with the Protein Data Bank before or in concert with article publication; the article will link to the PDB deposition using the PDB reference code. • Relevant experimental parameters, unit-cell dimensions • are required as an integral part of article submission and are published within the article

  12. IUCr journal policy(2) Processed experimental data • For crystal/molecular structures with small unit cell (inorganic, metal-organic, organic) • Structure factors • Rietveld profiles • must be supplied in CIF format as an integral part of article submission and are freely available for download. SHELXL instruction files are also required for validation • For biological macromolecular structures • Structure factors • must be deposited with the Protein Data Bank before or in concert with article publication; the article will link to the PDB deposition using the PDB reference code.

  13. IUCr journal policy(3) Primary experimental data For crystal/molecular structures with small unit cell and for biological macromolecular structures: IUCr journals have no current policy regarding publication of diffraction images or similar raw data entities. However, IUCr Commissions are interested in the possibility of establishing community practices for the orderly retention and referencing of such data sets, and the IUCr would like to see such data sets become part of the routine record of scientific research in the future. • Typical size of raw data set (collection of diffraction images) ~ few Gb • Not large enough to warrant dedicated data centres • Too large for existing database operations (CCDC, PDB) • Retention by individual scientists ~ 1 year • Possibility of retention by laboratories/experimental facilities • Distributed archive • Requires identification/linking protocols to publications

  14. Extending the policy Diffraction Data Deposition Working Group • Convened IUCr Congress Madrid, August 2011 • Core working group: 8 members, 5 consultants from various crystallographic disciplines; national facility representatives; curated databases; data archiving repositories; ICSTI/CODATA representatives • Community consultation group: ~ 50 additional members representing IUCr Commissions, experimental facilities, journal editors, practising crystallographers • Open interaction with the wider community through a public discussion forum http://forums.iucr.org/viewforum.php?f=7

  15. Validation of published data • For crystal/molecular structures with small unit cell (inorganic, metal-organic, organic) • All structural models are processed upon article submission by the IUCr checkCIF suite and the resultant report scrutinised during the peer review process • Structure factor files are now analysed with checkCIF and a report also generated for review • checkCIF reports are published as supplementary documents for every Acta Cryst. E article • SHELXL refinement instruction files are required • For biological macromolecular structures • Structural models are annotated and curated by PDB staff during deposition; authors are consulted and given opportunities for revision if errors or anomalies are found • Structure factor files are checked at the PDB against the deposited structure • Data validation reports must be supplied to IUCr journals

  16. Data visualization • Three-dimensional structural models can be visualized interactively in IUCr journals in several ways: • From journal contents lists • '3d view' provides Jmol applet (all CIFs) • '3d view' allows browsers to launch helper applications such as Mercury, Rasmol (multi-structure CIFs) • Within journal articles • Enhanced Jmol figures created by authors • Links to PDB visual representations

  17. Data publication: beyond journals eCrystals – Southampton Archive for crystal structures generated by the Southampton Chemical Crystallography Group and the EPSRC UK National Crystallography Service http://ecrystals.chem.soton.ac.uk/

  18. The future • Not all crystallography journals accept structural data for deposit, and IUCr journals do not (at present) accept primary data sets. Ensuring that data are retained in the permanent record of science will require: • increased willingness by journals to deposit structural data and structure factors • increased willingness of structural databases to store and curate structural data, structure factors and perhaps primary data • increased collaboration between journals and structural databases that archive data sets • assignment of persistent identifiers to unpublished data held in • domain-specific repositories (e.g. eCrystals at Southampton) • institutional repositories • synchrotron and other experimental facilities • image data stores (e.g. Atlas, Rutherford Appleton Laboratories) • community codes of practice for storing, managing, archiving and accessing research data sets • standard 'compound document' descriptions to link data and publications

More Related