280 likes | 367 Views
Looking to the longer term: some perspectives on data curation and preservation. Dr Liz Lyon , DCC Associate Director Outreach Director, UKOLN, University of Bath, UK. Funded by: . This work is licensed under a Creative Commons Licence Attribution-ShareAlike 2.0. About UKOLN .
E N D
Looking to the longer term: some perspectives on data curation and preservation Dr Liz Lyon, DCC Associate Director Outreach Director, UKOLN, University of Bath, UK Funded by: This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
About UKOLN • “a centre of expertise in digital information management” • Funding: Joint Information Systems Committee (JISC) + Museums, Libraries & Archives Council (MLA) • Portfolio of R&D projects Delos, DRIVER, Grand Challenge • 29+ staff based at the University of Bath • Inform the library, information, education and cultural heritage communities • Policy, advocacy at national level, build innovative Web-based systems & services, R&D, e-journal Ariadne, workshops and conferences. • http://www.ukoln.ac.uk/ Acknowledgement: Alex Ball, Grand Challenge Project
UK Digital Curation Centre • Digital Curation Centre • Funded by JISC & EPSRC • Development activities • Research agenda • Delivering services • Outreach Programme • http://www.dcc.ac.uk/
Overview • Data curation and digital preservation issues • Draw on research and scholarship perspectives • Data / information flows and the “business process” • UK Digital Curation Centre activities “maintaining and adding value to a trusted body of digital information for current and future use”
Reference datasets as infrastructure? Data-centric 2020 vision
(Very simple) Product Research Cycle & Data Curation (New) knowledge extraction: data mining, modelling, analysis, synthesis Formulate ideas / hypothesis, test, experiment, observe, design: data creation, collection & capture Data processing Data processing Data processing Data management storage & validation: description, deposit, self-archiving, preservation, certification e-Infrastructure Open ?? access Collaboration Adding value: Data linking, annotation, visualisation, simulation Data processing Data processing Scholarly communications & Business transactions: data disclosure, publication, citation, discovery, re-use This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
RepoMMan: Repository Metadata and Management (Hull) using WS-BPEL • Are your engineering workflows identified and described? Workflow e-Scientist desktop? Slide: Carole Goble
repository repository repository repository repository heterogeneous - metadata formats, content formats, identifiers, packaging standards fusion layer ‘repository federator’ homogeneous - metadata formats, content formats, identifiers, packaging standards portal portal portal portal portal “JISC Vision”: a global landscape of federated repositories • e-Framework and Information Environment context • Define common + domain-specific + repository “services” • Interoperability based on open standards, software tools • Multi-disciplinary, cross-sectoral • National, institutional • Different platforms • Many format types: data, eprints, images, geospatial From Andy Powell: http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/presentations/jiie-jcs-2005/
Pilot Engineering Repository Xsearch PerX http://www.engineering.ac.uk/
Interoperability??? STEP ISO10303
Repositories and OAIS Reference Model“an archive consisting of an organisation of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community..an identified group of potential consumers who should be able to understand a particular set of information”
Assuring permanence: digital preservation • Trusted DR Audit Checklist for Certification Draft Research Libraries Group-NARA Taskforce 2005 Defined criteria: • Organisation • Functions, processes & procedures • Designated community & usability • Technologies & technical infrastructure • Revised Checklist based on feedback and pilot audits (KB, BADC) • Self-certification: DINI-Zertifikat: requirements & recommendations: • Server policy / Guidelines • Author support • Legal issues • Authenticity and integrity • Cataloguing • Access statistics • Long-term sustainability • Has your repository / PLM been audited?
Interdisciplinary discovery • Validation, publication & discovery of data models & schema • Harmonisation and normalisation of metadata and semantics • Packaging standards: METS, MPEG-21 DIDL • Formal high-level and domain ontologies • ePrints DC Application Profile http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile • eBank Application Profile crystallography data http://www.ukoln.ac.uk/projects/ebank-uk/schemas/ • What data models and metadata schema are in place?
Persistent identifiers for data citation • How will they be used? We need use cases: depositor, author, service provider, researcher, publisher? • Schemes: DOI, Handle, ARK, PURL • Global identification: express as http URIs • Data citation (human and machine-actionable) • Publication & citation of scientific primary data project National Library for Science & Technology (TIB), University of Hanover, Germany. STD-DOI Project DOI registry for datasets http://www.std-doi.de • Is there a data citation policy? • What persistent identifiers have been assigned to your data?
Discovering data: eBank Project • Domain identifier: International Chemical Identifier (INChI) code • Google molecule using INChI • Slide from Simon Coles Coles, S.J., Day, N.E., Murray-Rust, P., Rzepa, H.S., Zhang, Y., Org. Biomol. Chem., 2005, (10),1832-1834. DOI: 10.1039/b502828k Domain identifiers for engineering?
Format migration challenges? CAD Program Compatibility Chart http://www.okino.com/conv/filefrmt_cad.htm
Development: Representation Information Registry Repository • “DCC Approach to Digital Curation” based on OAIS • Representation Information Registry Repository • Prototype demonstrator: based on 2 key concepts to facilitate sharing of the curation effort • Curation Persistent Identifier (CPID) • Descriptive “label” (structural, semantic, other metadata) • Development of (M2M) tools and interfaces for creating, using and re-using representation information • http://dev.dcc.ac.uk Wiki and email list • EU CASPAR Integrated Project • Task Force on the Permanent Access to the Records of Science http://www.casparpreserves.info/pages/1/index.htm http://tfpa.kb.nl/
Allows applications to talk to many different registry implementations e.g. GDFR, PRONOM, UDDI Registry API • GUI Access and via Web browser http://registry.dcc.ac.uk
Research at the University of Edinburgh • Scientific databases: Annotation scoping report • New annotation model + prototype MONDRIAN • Intuitive visual interface iMONDRIAN • Annotate sets of values • Support for querying annotations Adding value through annotation
NaCTeM http://www.nactem.ac.uk/ Emerging tools: TerMine, GENIA, Cafetiere • Knowledge extraction: • Mining (data, text, structures) • Modelling (economic, climate, mathematical, biological…) • Analysis (statistical, lexical, gene….) Nature 23 March 2006 OTMI: Open Text Mining Interface
Supporting the community: Services • HELPDESK@dcc.ac.uk • legal - technical guidance • Curation Manual 45 chapters planned • Metadata (umbrella) • Open Source • Archival metadata • Preservation metadata • Selection & appraisal • Curating emails • Briefing Papers • Curating emails • Digital repositories • Geospatial data • Data protection • eScience data • Case studies
Supporting the community: Outreach & Services • Workshops: • Geospatial data, NeSC, 27 October • OAIS 5 year Review, October • Audit & Certification Forum, October • Records Management, L’pool 30 Nov • Curation & Preservation Training, Dec • 2007 Preservation of journals tbc • 2007 Legal environment tbc • 2007 Preparing for audit tbc • Information Days British Library L’pool UCL • 2nd International DCC Conference 21-22 November, Glasgow • Keynotes: Hans F. Hoffmann, CERN, Clifford Lynch, CNI
DCC Phase 2: 2007-2010 • Working more closely with data centres, e-Science Programmes and Research Councils • SCARP Project: disciplinary approach • JISC Digital Repository Programme collaboration • RepInfo Registry service migration • Define self-assessment procedures and tools • Collaborate with CASPAR, DPE and PLANETS (EU-funded Digital Preservation Projects) • Workshop Programme, International Conference 2007
Thank you.Questions? e.lyon@ukoln.ac.uk Join the DCC Associates Network at www.dcc.ac.uk