370 likes | 492 Views
Curation: making data suitable for re-use. Chris Rusbridge Presentation at FIBS Seminar. Contents. Science and digital curation What to do with your data: frontiers of practice Repository frontiers. Digital Curation Centre Mission.
E N D
Curation: making data suitable for re-use Chris Rusbridge Presentation at FIBS Seminar
Contents • Science and digital curation • What to do with your data: frontiers of practice • Repository frontiers
Digital Curation Centre Mission “The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”
TWOMASS (Infrared) SDSS (Visual) Slide from Rajendra Bose
New discovery… • National Virtual Observatory • Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”
Curation • Data increasingly important as evidence • Key part of the scholarly record • Experimental verifiability (the basis of science) • Allows additional interpretations • Unrepeatable observations & experiments (particularly environmental in broadest sense) • Legal, compliance & transactions • Cultural resources
What kinds of data? • Observations • eg UARS (Upper Atmosphere) Level 0: telemetry • UARS Level 1: measured physical parameters (post calibration?) • Derived data • UARS Level 2: calculated geophysical? profiles • UARS level 3: gridded, interpolated? • Combined data • Crafted data • Eg annotated gene/protein databases • Descriptive (meta)data
What to do with it? • Keep as part of experiment • Deposit in institutional or discipline repository • Possible time-limited embargos • Cite it • “Publish” in support for articles
Internet Archaeology: publication with data (sadly, a preservation nightmare!)
What are the reusability issues? • Data not neutral to hypothesis • Hard to know the risks & pitfalls of a particular dataset • Data not self-describing: hard to find appropriate data • Hard to “understand” data once found • Hard to use data once understood
What to do about it? • Build curation/reusability into your workflow • Curation begins before creation • What’s easy at first becomes (impossibly) hard later • Describe your data (metadata) • Keep experimental parameters (technical, who, what, when, where etc) • Keep data descriptions (schemas, “representation information”, etc) • Keep data! • Use standard/agreed formats for data • Make ownership & restrictions clear • Explain how to cite your data
Data resource stages • Curated data is created… • Observations? Fixed! • Or Acquired… • Data brought/bought from outside • Ingest • Development • Derived, refined, combined, processed data • Potentially many stages
Context • Data meaningless without context • Linkage • Metadata of many kinds • Workflow! • Provenance • Authenticity • Computational lineage
NASA research group3 University research group1 University research group2 local decision-making body Slide from Rajendra Bose
Access and re-use • Ethics and rights control access • Weak in expressing this long-term • Collaboration tools • Annotation, discussion, review • Re-use leading to change and development • “Publication” • Not just in “print” • Underlying data should be “published”, too • Citation…
Citation needs… • An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al) • Not important for original observations • Don’t mess with those data • Less important for incremental datasets • Later stuff should not invalidate earlier • Very important for revisable datasets • Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change • Eg Mapping… OS maps represent a huge database that changes on a daily basis
Curation: Individual • “Small science 2-3 times more data than Big science”, but much more at risk • PhD student? RA? PI? Administrator? IT support? • Data potentially on local hard drives, or at best shared network drives • May be inadequately protected • Liable for policy-led deletion on resignation • Individual “knows” too much • Documentation/metadata unlikely to be adequate • Future: gone!
Department: eCrystals • Partnership with Institutional Repository • Specialist department archive (& national service) • Workflow recording of lab parameters (R4L) • Public & private elements • Trying to build eCrystals federation (eBank 3) • Future: likely to continue
Institution: Cambridge Chemistry • 175,000 small molecule structures in CML • Alongside Archaeology, Manuscripts, Learning Materials, etc • No library curation skills; dependent on research group enthusiast • Collection isolated from other Chemistry • (Only 5 UK institutional repositories claim to hold data) • Future: assured…
Community: LOCKSS? • Self-selected group of collectors: closest to genuine open activity (despite Alliance)? • Traditionally libraries collecting eJournals • Model respects IPR • No domain expertise; rely on origins • Data limitations… • Future: potentially very persistent (low cost, high reliability, attack resistance, distributed)
Discipline: Atmospheric Science • Strong believer in need for domain scientists as curators • Significant participant in “community proxy” agenda-setting activities • Internationally fragmented resources • Future: mostly dependent on grant funding (but strong commitment)
Discipline: Pharmacology • International Scientific Union • Attempting to build credit for data contributions • Future: extremely limited funding
Discipline: Bio/Health • UK PubMedCentral! • (you heard about this earlier)
Issues: Nature article 23 June 05 • Databases in Peril • 51 out of 89 biological databases contacted reported they were struggling financially • 7 have closed • Several being updated in owner’s spare time • (Notes that not all deserve long term support) • [Nucleic Acids Research reports 858 databases in 2006!] • Major issue: money
Publisher: Crystallography • Publisher and Scientific Union • Created key domain crystallographic standard (CIF) • Strong motivator for deposit of structure data • Consistent quality checks • DOIs used for structure data • Future: publishing business model Slide from IUCr
National bodies: British Library • Serious and robust approach • Legal deposit powers & responsibilities as driver • Oriented primarily towards “cultural heritage” (broadly interpreted) • Little data, no science domain experience • Future: strong future commitment
National bodies: TNA/NDAD • Specialist archive for government datasets • Understand government regulations, dynamics & requirements • Subject generalists; disconnected from associated science • Technology specialists (understand databases) • Future: likely to pass eventually to The National Archives
3rd parties: Portico • Specific area: eJournals • Depends on publisher agreements • No data or domain science expertise • Future: commitment from Mellon + publishers + subscriptions, good funding mix
3rd Parties: Iron Mountain? • Records management IS a curation problem • Organisations like this very likely to branch out • No domain science expertise • Future: business case, viability, stock market…
Institutions & the network • Institutions have fundamental sustainability • Disciplines have domain knowledge advantage but sustainability is an issue • Can we get the best of both?