300 likes | 318 Views
This presentation explores the attitudes and aspirations in scientific repositories, discussing curation and preservation issues, research data, metadata, ownership and support, and the challenges of handling large datasets.
E N D
Attitudes and aspirations in a diverse world The Project StORe perspective on scientific repositories Graham Pryor – 22nd November 2006 Digital Data Curation in Practice - 2nd International Digital Curation Conference, Glasgow 21-22 November 2006
StORe Guide What’s in StORe? • Curation and preservation issues • Attitudes and aspirations • Research data • Repositories • Metadata • Ownership and support • Too huge to handle? 2nd International Digital Curation Conference, Glasgow
Digital Data Curation Definition: the actions needed to maintain digital data and other digital materials over their entire life-cycle and indefinitely for current and future generations of users. These actions include not only the processes of digital archiving and preservation but also all of the processes that are essential to good data creation and management, as well as the capacity to add value to data to generate new sources of information and knowledge. 2nd International Digital Curation Conference, Glasgow
What’s in StORe? – Aims 1 Attach new value to the intellectual products of academic research by providing two-way links between source and output repositories 2nd International Digital Curation Conference, Glasgow
What’s in StORe – Aims 2 • Surveys to identify workflows and norms, problems and desirable enhancements to source/output repositories • A generic technical specification for functional enhancements to source and output repositories • Pilot middleware that demonstrates a bi-directional link • Independent evaluations of the pilot middleware and recommendations for future development as a generic platform for linking repositories 2nd International Digital Curation Conference, Glasgow
Dual deposit of data and publications already an accepted concept International strategies for data deposit and data preservation Genuine desire to contribute to the wealth of knowledge Awareness of the critical need to assign and maintain appropriate metadata What’s in StORe? – Survey 2nd International Digital Curation Conference, Glasgow
Cultural and organisational barriers to deposit of research data in repositories Inherent culture of self-sufficiency in the generation and organisation of data Limited inclination towards voluntary deposit in open access source repositories Institutional output repositories not on the agenda of most researchers What’s in StORe? – Survey 2nd International Digital Curation Conference, Glasgow
Features of source data Often large and complex Can be impenetrable without local tools May seem ambiguous to project outsiders Are frequently held on standalone equipment Commonly comprise several data formats From the StORe survey Physics: raw data sets as large as petabytes (1015bytes) may be generated or analysed using software written within the project Biosciences: need to describe how data were produced, the laboratory conditions and methodology 70% of bioscience source data are not networked Chemists: data stored in numerous sub-folders (spectra, images, etc.) describing one process Research Data 2nd International Digital Curation Conference, Glasgow
Chemistry data sets: links between complex clusters “..it would have to be everything associated with that compound. There is no point having an NMR without a picture of what it is. Then it’s useful to have a synthesis scenario and say oh that could fit with that but I want proof and then that really is a paper. You know you can waste a lot of time trying to follow what people have done before that isn’t properly published and never have worth. It’s not always, but is it worth the risk of wasting too much of your time?” Research Data 2nd International Digital Curation Conference, Glasgow
Telemetry Video Topographical data Remote sensing Geophysical data Synthetic data Other Raw data Photographs Statistical data Drawings, Plots Images Databases Text-based files Spectra Derived data Instrument data Research Data Physics data types 2nd International Digital Curation Conference, Glasgow
30 25 20 15 10 5 0 Other: CAD/GIS: Plain text (.txt): Rich text files (.rtf): Statistical software: Tables/catalogues: Spreadsheets (Excel/.xls): Image files (.jpg, .tif, .bmp, .gif): Portable document format (.pdf): Database files (Access, MySQL): Word processed files (Word/.doc): Extensible mark-up language (XML): Hypertext mark-up language (HTML): Research Data Archaeology file types 2nd International Digital Curation Conference, Glasgow
Repositories • Source repository development is discipline-led • Large number of established services - we suggested: Archaeology Data Service, Brookhaven National Laboratories, CERN, GenBank, National Crystallography Service, NERC Data Centres, Protein Structures Database, SuperCOSMOS, UK Data Archive, UniProt - to which were added 99 others • Some international strategies/ - Astronomy (Virtual Observatory) 2nd International Digital Curation Conference, Glasgow
Repositories • Source repository development is discipline-led • Large number of established services - we suggested: Archaeology Data Service, Brookhaven National Laboratories, CERN, GenBank, National Crystallography Service, NERC Data Centres, Protein Structures Database, SuperCOSMOS, UK Data Archive, UniProt - to which were added 99 others • Some international strategies/mandates/ - Astronomy (Virtual Observatory) - Biosciences (sequence data) 2nd International Digital Curation Conference, Glasgow
Repositories • Source repository development is discipline-led • Large number of established services - we suggested: Archaeology Data Service, Brookhaven National Laboratories, CERN, GenBank, National Crystallography Service, NERC Data Centres, Protein Structures Database, SuperCOSMOS, UK Data Archive, UniProt - to which were added 99 others • Some international strategies/mandates/dual deposit - Astronomy (Virtual Observatory) - Biosciences (sequence data) - Chemistry (Crystallographic Data Centre) 2nd International Digital Curation Conference, Glasgow
Low awareness of repositories 65% of the chemists surveyed had not used a repository and were not familiar with the idea of open access repositories Repositories 2nd International Digital Curation Conference, Glasgow
Low awareness of repositories Low volume of repository use 65% of the chemists surveyed had not used a repository and were not familiar with the idea of open access repositories Many social scientists did not associate repositories with their research agenda Repositories 2nd International Digital Curation Conference, Glasgow
Low awareness of repositories Low volume of repository use 65% of the chemists surveyed had not used a repository and were not familiar with the idea of open access repositories Many social scientists did not associate repositories with their research agenda Repositories are only one of many potential data sources/archives used by researchers Repositories 2nd International Digital Curation Conference, Glasgow
Repositories • Low awareness of repositories • Low volume of repository use • Low rate of source data deposit 2nd International Digital Curation Conference, Glasgow
Repositories • Low awareness of repositories • Low volume of repository use • Low rate of source data deposit • Output repositories • prefer publisher over institutional • prefer Google type searching 2nd International Digital Curation Conference, Glasgow
Metadata • All disciplines: an awareness of the importance of appropriate metadata • Improvements to source repositories? Better metadata ranked highest • Metadata assignment considered challenging: intellectually and in the demands on one’s time Yet… • Evidence of lack of standard structures • Metadata assignment often almost an afterthought • One third of StORe respondents believed no metadata were being assigned 2nd International Digital Curation Conference, Glasgow
35 Archaeology 30 Astronomy 25 Biosciences 20 15 10 5 0 Not known Library staff Individual researchers Research support staff No formal metadata used Research team (collective) Repository admin./automatic Metadata assignment 2nd International Digital Curation Conference, Glasgow
Metadata • Where researchers are familiar with metadata they possess an in-depth knowledge of its use, applications and functions • The assignment of metadata automatically (or by a process that relieves the depositor of doing it) is preferred • Quote from theoretical chemistry interview: “Well, there’s lots of different types of metadata. There is metadata for discovery, there is metadata for semantics, there is metadata for intellectual property and so on and so forth. They are all important. If I find some piece of information and it’s not on open access then I can’t use it. If I find some piece of metadata and it’s in a language that my machine does not understand and there is no metadata, then it is uninterpretable, I cannot use it. If I am particularly concerned about the quality of data I need provenance metadata. So there are different needs for different people...” 2nd International Digital Curation Conference, Glasgow
Metadata • Need for improved and universal standards acknowledged • A clear link identified between the condition of metadata used and the level of support from information specialists • Recognition of the need for different metadata for different phases of research lifecycle (raw, processed, published data and beyond) and to assist cross-discipline interpretation 2nd International Digital Curation Conference, Glasgow
Ownership & Support • Working culture: self-reliance and a constant pressure to deliver • Qualified enthusiasm for deposit in source repositories: producer or consumer • Anxiety over predatory access and IPR • Storage methods: protectionism? • Provision of specialist support less a case of unavailability as not sought 2nd International Digital Curation Conference, Glasgow
Too Huge to Handle? • One of the aspects that the [Chemistry] interviewees commented upon was that there should be a wider organisational/institutional requirement that supports and manages the repositories, should they be source, output or institutional. • “…sustainability depends on a business model. And it’s a major problem that confronts everybody at the moment in aggregating data, whether it be raw data, processed data, metadata, primary publications, abstracts, things like that” 2nd International Digital Curation Conference, Glasgow
Too Huge to Handle? • Embedding of data management expertise within domains • Expensive? • Interventionist? • Too large and too difficult? 2nd International Digital Curation Conference, Glasgow
Too Huge to Handle? • Embedding of data management expertise within domains • Expensive? • Interventionist? • Too large and too difficult? • “Poor investment decisions can have major implications on how much information can be preserved, and how effectively” – Chris Rusbridge, http://www.ariadne.ac.uk/issue46/ 2nd International Digital Curation Conference, Glasgow
Too Huge to Handle? • MRC £1 million data sharing and preservation initiative - http://www.mrc.ac.uk/strategy-data_sharing.htm - initial focus on 4 to 6 unique datasets of long term value - engage community support; longer term business plan • Virtual Observatory “Exploit information management and curation experience in the university libraries and build on long-term institutional commitments to preservation” – Bob Hanisch http://www.arl.org/sparc/meetings/ala06/HanischPPT.pdf 2nd International Digital Curation Conference, Glasgow
END http://jiscstore.jot.com/SurveyPhase 2nd International Digital Curation Conference, Glasgow