120 likes | 232 Views
Re-Use of Scientific Data Collections. Peter Wittenburg The Language Archive - Max Planck Institute CLARIN European Research Infrastructure Nijmegen, The Netherlands. Background. since about 15 years digital repository building
E N D
Re-Use of Scientific Data Collections Peter Wittenburg The Language Archive - Max Planck Institute CLARIN European Research Infrastructure Nijmegen, The Netherlands
Background • since about 15 years digital repository building • (experimental and observational data about languages and mind, • store cultural heritage - endangered languages/cultures) • current state of repository • about 50 Terabytes of data • (audio, video, texts, eeg, fMRI, etc.) • about 1 million digital objects • all described and organized with IMDI metadata • all associated with PIDs • most objects checked on format correctness • open deposit/archiving service to community • regular quality assessment (data seal of approval)
LAT Technology REPLIX data replication with DEISA testing iRODS • 6 full copies and worldwide distribution of sub-collections • "complete" set of software components (creation -> utilization)
CLARIN Research Infrastructure • a network of strong centers as backbone for the infrastructure • are building a trust federation amongst all of them • implementing a 3-tier structure for data sustainability
Nature of Research Collections • research collections are dynamic - continuous change • (transformations, extensions, modifications/versions, relations, etc) • collections are created with certain research purpose in mind • but users re-combine objects in unpredicted ways • (virtual collections crossing institutional boundaries) • collections include a variety of resource types • increasingly complex external relationships • (raw data -> derived data -> annotations -> extractions ...) • access patterns can hardly be predicted • long-term access and interpretability are an issue • (cultural and scientific memory)
Organization of Collections • high-quality metadata are crucial for management and access • hq metadata allows to generate different trees based on • some useful criteria • depositors and managers need one canonical tree • (for management operations and rights definitions) • users want to create their own virtual organizations • users want to have several types of organizations • and sub-collections • users want to express various types of relations -> graphs • metadata are crucial for long-term interpretation • therefore representation in schema based format • various visualizations required • (catalogue browsing, searching, faceted search, GIS)
Building of Collections • build up "repository" experts or collaborate with them • define required descriptive categories • re-use existing ones and register new ones in • open registries (ISO 12620 - ISOcat; www.isocat.org) • map new ones to existing ones where possible • make use of a flexible component framework where • components/profiles refer to registered concepts via PIDs • re-use or create important vocabularies • check quality of all objects and do curation asap • allow metadata harvesting via OAI-PMH • provide efficient metadata organization/description tools • (design of ARBIL after 10 years of experience)
Access to existing Collections - • cater for different visualizations of metadata • (GIS, Faceted Browser, simple/complex search, etc. • see VLO: www.clarin.eu/vlo) • create "community portals" - costs are high • (nice web-pages with embedded metadata queries) • cater for machine usage, i.e. support APIs • provide DublinCore semantics for the occasional user • why not social tagging, but keep it separate • support virtual communities
Level of Descriptions for new Comm. - • recommend to create "atomic objects" to foster re-purposing • description level depends on purposes • if it is to be used for research, high quality MD required • choice of terminology important • what to do with legacy data??? • get funds for curation
Pillars for long-term availability • in space research (ESA) 40% of access to old data • in humanities much more of course • archiving = research access requirements • quality of organization and coherence of collections • reduces maintenance and conversion costs • makes usage easy • of course make some PR and create attractiveness • be present in portals (Virtual Language Observatory) • of course provide easy access to content • need to speak about "maintaining cultural/scientific memory" • need to speak about grand challenges and simulation • both are data driven • how can we maintain stable life??
Sustainability of Collections • technical aspect • implement a 3-tier structure • (user communities - community centers - data centers) • prevent data copying (operations on data, trust, etc) • organize preservation cost-effectively • (MPI: 4 * 50 TB at data centers < 5.000 €, costs decrease over time, • economy of scale, green computing) • separate acquisition, curation and access funding • since they can be expensive • immediate curation as a basic principle • late curation is more expensive (see Beagrie)
Sustainability of Collections • responsibility aspect • funding & research organizations are responsible • need to reserve some percentage for continuous access • need a process for taking decisions • (experimental data often obsolete after N years) • need quality assessment procedures