140 likes | 265 Views
Ten Habits of Highly Effective Data. Anita de Waard VP Research Data Collaborations a.dewaard@elsevier.com http ://researchdata.elsevier.com /. The Maslow Hierarchy for humans:. 9. Usable (allow tools to run on it). 8. Citable (able to point & track citations).
E N D
Ten Habits of Highly Effective Data Anita de WaardVP Research Data Collaborationsa.dewaard@elsevier.com http://researchdata.elsevier.com/
The Maslow Hierarchy for humans: 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form)
A Maslow Hierarchy for Data: 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form)
1. Preserve: Data Rescue Challenge 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) • With IEDA/Lamont: award succesful data rescue attempts • Awarded at AGU 2013 • 23 submissions of data that was digitized, preserved, made available • Winner: NIMBUS Data Rescue: • Recovery, reprocessing and digitization of the infrared and visible observations along with their navigation and formatting. • Over 4000 7-track tapes of global infrared satellite data were read and reprocessed. • Nearly 200,000 visible light images were scanned, rectified and navigated. • All the resultant data was converted to HDF-5 (NetCDF) format and freely distributed to users from NASA and NSIDC servers. • This data was then used to calculate monthly sea ice extents for both the Arctic d the Antarctic. • Conclusion: we (collectively) need to do more of this! How can we fund it? 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form)
2. Archive: Olive Project 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) CMU CS & Library: funded by a grant from the IMLS, Elsevier is partner Goal: Preservation of executable content - nowadays a large part of intellectual output, and very fragile Identified a series of software packages and prepared VM to preserve Does it work? Yes – see video (1:24)
3. Access: Urban Legend 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) Part 1: Metadata acquisition Step through experimental process in series of dropdown menus in simple web UI Can be tailored to workflow of individual researcher Connected to shared ontologies through lookup table, managed centrally in lab Connect to data input console (Igor Pro)
4. Comprehend: Urban Legend 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) Part 2: Data Dashboard Access, select and manipulate data (calculate properties, sort and plot) Final goal: interactive figures linked to data Plan to expand to more labs, other data
5. Discover: Data Discovery Index 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) • NIH interested in creating DDI consortium • Three places where data is deposited: • Curated sources for a single data type (e.g.Protein Data Bank, VentDB, Hubble Space Data) • Non- or semicurated sources for different data types (e.g. DataDryad, Dataverse, Figshare) • Tables in papers: • Ways to find this: • Cross-domain query tools, i.e. NIF, DataOne, etc • Search for papers -> link to data • How to find data in papers?? • Propose to build prototypes across all of these data sources: • Needs NLP, models of data patterns? What else?
6. Reproduce: Resource Identifier Initiative 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) Force11 Working Group to add data identifiers to articles that is • 1) Machine readable; • 2) Free to generate and access; • 3) Consistent across publishers and journals. • Authors publishing in participating journals will be asked to provide RRID's for their resources; these are added to the keyword field • RRID's will be drawn from: • The Antibody Registry • Model Organism Databases • NIF Resource Registry • So far, Springer, Wiley, Biomednet, Elsevier journals have signed up with 11 journals, more to come • Wide community adoption!
7.Trust: Moonrocks 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) How can we scale up data curation?Pilot project with IEDA: Lunar geochemistry database: leapfrog & improve curation time 1-year pilot, funded by Elsevier If spreadsheet columns/headers map to RDB schema, we can scale up curation process and move from tables > curated databases!
8. Cite: Force11 Data Citation Principles 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) • Importance:Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. • Credit and attribution: Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data. • Evidence: Where a specific claim rests upon data, the corresponding data citation should be provided. • Unique Identification: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. • Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data. • Persistence: Metadata describing the data, and unique identifiers should persist, even beyond the lifespan of the data they describe. • Versioning and granularity: Data citations should facilitate identification and access to different versions and/or subsets of data. Citations should include sufficient detail to verifiably link the citing work to the portion and version of data cited. • Interoperability and flexibility:Data citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data citation practices across communities. 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) Another Force11 Working group Defined 8 principles: Now seeking endorsement/working on implementation
9. Use: Executable Papers 9. Usable (allow tools to run on it) 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) 6. Reproducible (others can redo experiments) 5. Discoverable (can be indexed by a system) 4. Comprehensible (others can understand data & processes) 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) 1. Preserved (existing in some form) • Result of a challenge to come up with cyberinfrastructure components to enable executable papers • Pilot in Computer Science journals • See all code in the paper • Save it, export it • Change it and rerun on data set:
10: Let’s allow our data to be happy! 9. Usable (allow tools to run on it) Validation Metadata: Reproduction, Curation; Selection, Citation, Usage, Metrics 8. Citable (able to point & track citations) 7. Trusted (validated/checked by reviewers) Record Metadata: DOI, Date, Author, Institute, etc. 6. Reproducible (others can redo experiments) Experimental Metadata: Objects, Procedures, Properties 5. Discoverable (can be indexed by a system) Analyze: Mathematical/computational processes and analytics Processed Data 4. Comprehensible (others can understand data & processes) Execute: Direct settings on equipment, circumstances of measurement Raw Data 3. Accessible (can be accessed by others) 2. Archived (long-term & format-independent) Prepare: Reagents, species/specimen/cell type, preparation details Entity IDs 1. Preserved (existing in some form)
Minimize your metadata footprint! • Recycle: • Make sure you design upstream metadata with downstream processes in mind • Useful exercise: ‘buy a tag’ where users/systems that will store/query/cite data say what they need to do their job • Learn from genetics: one datum can play several different roles! • Reuse: • ‘The good thing about standards is that there are so many to choose from’ • Haendel et al looking at 54 (!!) data standards: many have only been used once/for one group • Employ a common element set + modular additions over whole new schema • Reduce: • Every tag needs to be added and read by someone/thing: this adds cost and waste • Consider ‘return on investment’ per metadata item • TBL: what if “http://” was “h/”?