350 likes | 362 Views
Digital Curation 101 “Taster” Joy Davidson, Associate Director, DCC: british.editor@erpanet.org Sarah Higgins, Standards Advisor, DCC: Sarah.Higgins@ed.ac.uk. Funded by:.
E N D
Digital Curation 101 “Taster” Joy Davidson, Associate Director, DCC: british.editor@erpanet.org Sarah Higgins, Standards Advisor, DCC: Sarah.Higgins@ed.ac.uk Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
DC 101 aims and objectives Data management and curation are becoming increasingly integral for successful research or digitisation bids. Using the context of beginning a new research bid, this short course aims to introduce participants to the DCC Curation Lifecycle Model as a means of contextualising the range and nature of roles and activities required to maintain access to data over time. While the DCC Curation Lifecycle Model is sequential, it is flexible and allows users to start at any point in the model, or start to address issues which have had lower priority, depending on their current needs. Ultimately, tools and approaches will evolve over time, but if participants understand the bigger picture they will be in a better position to make critical decisions that best reflect their individual needs. The course will introduce participants to some of the tools and approaches and provide them with pointers to further information and support. The course is aimed at researchers, content creators and those who support them. We hope that participants leave the course equipped to explain why data curation is important and what roles they have to play in the process.
What is curation? Data have importance as the evidential base for scholarly conclusions, and for the validation of those conclusions, a basic tenet of which is reproducibility. Curation is the active management and appraisal of data over the lifecycle of scholarly and scientific interest; it is the key to reproducibility and reuse. This adds value through the provision of context and linkage: placing emphasis on 'publishing' data in ways that ease reuse, with implications for metadata and interoperability. Data curation is part of good research and content management practice.
Why Curate? Curation brings immediate and longer-term benefits: Access to reliable, working data – both for the creator and users Compliance with funding body and research council mandates on data sharing, management and access Independent validation of research findings Reliable lab and field electronic notebooks through trustworthy capture Large amounts of data can be developed and analysed across different locations by maintaining consistency in working practices and interpretations Relationship management between different versions of dynamic or evolving datasets is easier Facilitated linkage with related research and between primary, secondary and tertiary data Knowledge and data originating from short-term research projects does not become obsolete or inaccessible when funding expires Innovative data set combining is possiblee.g. combined historic biodiversity data and GIS data can be used to investigate trends in ecosystem development.
Lifecycle approach to curation digital materials are fragile and susceptible to change from technological advances from creation onwards activities (or lack of) at each lifecycle stage influence ability to manage and preserve materials in subsequent stages reliable re-use of digital materials is only possible if materials are curated in such a way that their authenticity and integrity are retained requires significant input and buy-in from the range of stakeholders – creators, curators, IT staff, management helps maximise initial investment made in creating or gathering data supports verification of provenance facilitates continuity of service From:Pennock, Maureen, Digital Curation: A Life-Cycle Approach to Managing and Preserving Usable Digital Information,(2007)
The DCC Curation Lifecycle Model • Provides a graphical high level overview of the stages • required for successful curation and preservation of data. • It can be used to plan activities within an • organisation or consortium to ensure all • necessary stages are undertaken, • each in the correct sequence. • Full Lifecycle Actions • Sequential Actions • Occasional Actions • http://www.dcc.ac.uk/lifecycle-model/
Researchers and content creators tend to focus on: • conceptualise • create or receive • ingest • store • access, use and reuse • data • description • community watch and participation
Researchers and content creators tend to focus less on: • appraise and select • dispose • preservation action • transform • representation information • preservation planning • curate and preserve • migrate • reappraise
Conceptualise “Conceive and plan the creation of data, including capture method and storage options.” • Researchers: • define a research question • begin to design the experiment • seek funding • conceive and plan the creation of data • consider capture methods and • storage options • identify research collaborators • identify potential subjects • Roles: researcher, funding bodies, publishers, IT department, ethics panel • Plan with digital curation in mind! • Decisions made at the Conceptualise stage impact on every other stage of the lifecycle.
Specific issues to consider for the Conceptualise stage: • Research design and workflows – what do you want to do? • What storage needs to you anticipate using? Does your institution have the capacity for this? Will you keep raw or derived data or both? • Will you make use of any existing data? Will you need to obtain rights to use it? • Do you want your data to interoperate with other datasets? If so, how will you ensure that this is possible? • What are the funder’s requirements regarding curation and preservation? Will they pay for curation activity? • Will the research involve any legal restrictions on the use and access to the data? • Are there any data protection issues that will require data cleaning before the data can be accessed and used? • Do you require ethical approval from your institution or funder? Will this have any impact on the data’s potential use and reuse? • Do you need to calibrate data capture devices? Will this need to occur at multiple sites? • Will the data be released under Creative Commons or Science Commons licenses? • Are there likely to be any embargoes on data publication?
Create or Receive • “Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation”. OR “Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.” • Roles: researchers, information specialists, technical support • Ensure data are curation ready! • Be careful - data may be irreplaceable • Capture context for long-term reuse and comprehensibility. • Clearly identify IPR at an early stage. This can become murky later in the process.
Specific issues to consider for the Create or Receive stage: • What do you want people to be able to do with the data you are generating? • What do you not want people to be able to do with the data? • Are there any variations between data capture tools located at different sites? How will you ensure that these are recorded/addressed? Consistency of testing and data acquisition are crucial. • Will you be adhering to any content, syntax, and structure standards? Are these easily available for use by everyone on the project team? • Who will have rights over any collaboratively generated data (eg., databases) • Who will you record contextual metadata and how? • What level of data quality do you need to achieve? How will you ensure this level is achieved across all partners? • Will you make use of any ontologies to facilitate data integration? • Will you make use of any data collection policies? • How will you handles file naming and version control? • Do you have access to training and support for any/all of the above?
Ingest and Store “Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements. Store the data in a secure manner adhering to relevant standards.” Data is transferred to a curation environment such as an institutional repository or a subject-based repository. Roles: information specialists, repository managers, researchers Prepare data for long-term storage, access and continuity! Storage may be a dedicated data repository or a folder on a shared drive, but must be considered, secure and adhere to relevant standards.
Specific issues to consider for the Ingest and Store stages: • Does the data have sufficient metadata? If more is required, who will be responsible for providing it? • Will the data require additional cleaning before it can be ingested into the repository? • Will frequent access to the data be required? If so, this could affect the storage choices. • What level of responsibility does the repository indicate it will take on with regards to stewardship? • Does the repository accept your data formats? If not, will there be any normalisation processes that may occur with the deposit of non-preferred formats? • Does the repository outsource any of its activity? Could this have an impact on your data? • Does the repository have sufficient resources and policies in place? • Once ingest is complete, is there a formal acknowledgement that the transfer of custody has occurred?
Access, Use and Reuse “Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable.” Roles: repository managers, researchers Ensure access and continuity!
Specific issues to consider for the Access, Use and Reuse stage: • Are the intended users of the data able to access it and make use of it? i.e., are they able to use the data in the way that you originally intended them to use it? What about non-intended users? • Are there any restrictions on access and reuse Ensure that these are communicated to the repository staff. • Researchers should work with repository managers to develop suitable access policies and terms for use of the data • If you are planning on making your data freely accessible for reuse, have you supplied enough context to enable its reliable reuse? • Are they adequate finding aid to help locate and retrieve your data within the repository? • Is the data practically interoperable with other datasets? Does it need to be?
Appraise and Select • “Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.” • Researchers and content creators, along with information specialists use quality checks to identify and evaluate data for long-term curation: • must be legal, appropriate, and valuable • may include data objects, metadata, and contextual information. Roles: researchers, information specialists, funding bodies Develop robust policies!The ‘keep everything’ approach quickly becomes unviable. As the volume of curated data increases, efficient search and retrieval becomes more difficult.
Specific issues to consider for the Appraise and Select stage: • Does the data meet the data quality metrics identified by both the researchers and the archive? Who will be responsible for the final decision? Can errors in the data remain undetected at this stage, and cause problems at later stages? • Has enough contextual information been collected to make an informed decision about which data to keep? • What is the minimum you need to keep for your data findings and publications to be supported over time? • Are there any data that you, by law, are not allowed to keep? How will it be destroyed and what evidence will you be able to provide to support this if necessary? • Do you have any schedule for re-appraisal over time? • Do you have access to expertise in your project staff or at your institution to assist with selection and appraisal? • Your initial bid is a good place to start as you’ll have clearly indicated what outputs you planned to produce. • Does your selection and appraisal fit in with your funding body requirements? What do they expect you to keep and where does it need to be kept?
Preservation action “Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.” Roles: information specialists, preservation practitioners, repository managers Community Watch activities can be very helpful at this stage to identify imminent risks to data.
Specific issues to consider for the Preservation Action stage: • Does the repository participate in community watch and ongoing preservation planning activity? • Does the repository manager know what the significant properties of your data are? If not, some preservation actions can alter the significant properties. • Are any preservation actions undertaken transparent and documented? • Does the repository have legal rights to undertake preservation actions at all? • Does the researcher require notification of any preservation actions that may affect the intended use of the data? If so, have mechanisms been set in place to facilitate this? • If certain actions are recommended, are they suitable for your data? If not, are repository staff aware of any restrictions?
Transform • “Create new data from the original, for example: by migration into a different format; or by creating a subset, by selection or query, to create newly derived results, perhaps for publication.” • New data may be generated from the original: • by format migration; • through integration with other data; • by new analyses and techniques applied within or across disciplines • Roles: researchers • New uses for curated data!Derivative data, new visualisations or enhancements feed back into the Conceptualise and Create stages of the lifecycle which then starts anew.
Specific issues to consider for the Transform stage: • Metadata aggregation to join up with other datasets, this integration of data drives new curation requirements. • Image normalisation and automated analysis creates a variety of new contextual and provenance information • If transformations or derivatives are produced (e.g. noise reduction) it must be accompanied by appropriate metadata • Use community standards for recording provenance to safeguard against fast changing techniques. • Does the community have sufficient support in transformation actions? • Is more value gained from producing new data or from transforming old data in new ways?
More information on all these stages is in the workshop packs! • info@dcc.ac.uk
Tools and resources to help with the DCC Curation Lifecycle stages: Conceptualise DCC Policy Pages: http://www.dcc.ac.uk/resource/curation-policies Check our handy table as a starting point to make sure you are aware of any curation related requirements for your particular funding body. If your funding body is not in our table, please get in touch with us so that we can add their policy details. DCC Helpdesk: http://www.dcc.ac.uk/helpdesk If you need further assistance at this stage, please don’t hesitate to drop us a line via our helpdesk and we’ll make every effort to support your curation activity.
Tools and resources to help with the DCC Curation Lifecycle stages: • Create or Receive • DCC DIFFUSE You might wish to consult the DCC DIFFUSE database for standards frameworks related to your area of research. We strongly encourage the contribution of standards frameworks for specific domains from our user community to help ensure that this is a community-driven resource. http://www.dcc.ac.uk/diffuse/ DCC Technology and Standards Watch papers:http://www.dcc.ac.uk/resource AHDS advice on creating digital resources: http://www.ahds.ac.uk/creating/index.htm/
Tools and resources to help with the DCC Curation Lifecycle stages: Ingest and Store AHDS recommended stable formats for different types of data:http://www.ahds.ac.uk/depositing/deposit-formats.htm Access, Use, Re-use DCC Resource Centre: http://www.dcc.ac.uk/resource DCC Helpdesk: http://www.dcc.ac.uk/helpdesk DCC Legal Blog:http://dccblawg.blogspot.com/ DCC Briefing Papers (particularly Data Protection):http://www.dcc.ac.uk/resource/briefing-papers/
Tools and resources to help with the DCC Curation Lifecycle stages: • Appraise and Select • Data Audit Framework tool: http://www.data-audit.eu/ • DCC Briefing Paper and Curation Manual chapter on Appraisal and Selection :http://www.dcc.ac.uk/resource/briefing-papers/; http://www.dcc.ac.uk/resource/curation-manual/chapters/ • US Geological Survey selection and appraisal toolkit
Tools and resources to help with the DCC Curation Lifecycle stages: Preservation Action Dr. Manfred Thaller’s Fileshooter tool: Good for assessing file format robustness using your own success metrics. http://github.com/mcarden/shotgun/blob/39761fdd190faa47e9be09901782cda6d9f4f687/shotGun.h PLANETS Testbed and Methodology: http://www.planets-project.eu/ DCC Curation Manual:http://www.dcc.ac.uk/resource/curation-manual/chapters/ Transform DCC Briefing Papers (particularly Interoperability):http://www.dcc.ac.uk/resource/briefing-papers/
CHECKLISTS: Conceptualise • Get into the habit of equating data curation with good research. • Know what your funding body expects you to do with your data and for how long. Assess your ability to be able to meet these expectations (i.e., do you need additional funding or staff?) • Determine intellectual property rights from the outset and ensure they are documented. • Identify any anticipated publication requirements (embargoes, restrictions on publishing over multiple sites) • Identify and document specific roles and responsibilities as early as possible.
CHECKLISTS: Access and Reuse Know what you want users to be able to do with your data and for how long. Pin down and communicate the significant properties of your data. Ensure that any restrictions on access and use are communicated and respected. Ensure that you provide enough context to ensure that your data can be located and used – either by the originally designated user community or new users over time. Ensure you clearly articulate any citation requirements and usage statistics that you require at the point of ingest so that repository managers know how your data should be cited if it is reused.
CHECKLISTS: Preservation Action • Know what you want people to be able to do with your data – this will impact many aspects (formats selected for long term storage, compression, etc…) • Pin down the significant properties of your data and communicate them – make sure that the people carrying out preservation actions know what they are. This might be through metadata or other means. • Don’t be afraid to be critical when reviewing ‘best practice’ and recommended approaches. They might work for the specific scenario for which they were created but not for you. Do you know the criteria used to rate things like ‘preferred’ formats? • Document preservation actions so that people know what has been done to the data over time. • Once you’ve gone through the exercise of producing a sound data management plan, you’ll be able to reuse many aspects of it – so each project data management plan will not need the same level of effort to complete.