1 / 12

Re-Use of Scientific Data Collections

Re-Use of Scientific Data Collections. Peter Wittenburg The Language Archive - Max Planck Institute CLARIN European Research Infrastructure Nijmegen, The Netherlands. Background. since about 15 years digital repository building

inigo
Download Presentation

Re-Use of Scientific Data Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Re-Use of Scientific Data Collections Peter Wittenburg The Language Archive - Max Planck Institute CLARIN European Research Infrastructure Nijmegen, The Netherlands

  2. Background • since about 15 years digital repository building • (experimental and observational data about languages and mind, • store cultural heritage - endangered languages/cultures) • current state of repository • about 50 Terabytes of data • (audio, video, texts, eeg, fMRI, etc.) • about 1 million digital objects • all described and organized with IMDI metadata • all associated with PIDs • most objects checked on format correctness • open deposit/archiving service to community • regular quality assessment (data seal of approval)

  3. LAT Technology REPLIX data replication with DEISA testing iRODS • 6 full copies and worldwide distribution of sub-collections • "complete" set of software components (creation -> utilization)

  4. CLARIN Research Infrastructure • a network of strong centers as backbone for the infrastructure • are building a trust federation amongst all of them • implementing a 3-tier structure for data sustainability

  5. Nature of Research Collections • research collections are dynamic - continuous change • (transformations, extensions, modifications/versions, relations, etc) • collections are created with certain research purpose in mind • but users re-combine objects in unpredicted ways • (virtual collections crossing institutional boundaries) • collections include a variety of resource types • increasingly complex external relationships • (raw data -> derived data -> annotations -> extractions ...) • access patterns can hardly be predicted • long-term access and interpretability are an issue • (cultural and scientific memory)

  6. Organization of Collections • high-quality metadata are crucial for management and access • hq metadata allows to generate different trees based on • some useful criteria • depositors and managers need one canonical tree • (for management operations and rights definitions) • users want to create their own virtual organizations • users want to have several types of organizations • and sub-collections • users want to express various types of relations -> graphs • metadata are crucial for long-term interpretation • therefore representation in schema based format • various visualizations required • (catalogue browsing, searching, faceted search, GIS)

  7. Building of Collections • build up "repository" experts or collaborate with them • define required descriptive categories • re-use existing ones and register new ones in • open registries (ISO 12620 - ISOcat; www.isocat.org) • map new ones to existing ones where possible • make use of a flexible component framework where • components/profiles refer to registered concepts via PIDs • re-use or create important vocabularies • check quality of all objects and do curation asap • allow metadata harvesting via OAI-PMH • provide efficient metadata organization/description tools • (design of ARBIL after 10 years of experience)

  8. Access to existing Collections - • cater for different visualizations of metadata • (GIS, Faceted Browser, simple/complex search, etc. • see VLO: www.clarin.eu/vlo) • create "community portals" - costs are high • (nice web-pages with embedded metadata queries) • cater for machine usage, i.e. support APIs • provide DublinCore semantics for the occasional user • why not social tagging, but keep it separate • support virtual communities

  9. Level of Descriptions for new Comm. - • recommend to create "atomic objects" to foster re-purposing • description level depends on purposes • if it is to be used for research, high quality MD required • choice of terminology important • what to do with legacy data??? • get funds for curation

  10. Pillars for long-term availability • in space research (ESA) 40% of access to old data • in humanities much more of course • archiving = research access requirements • quality of organization and coherence of collections • reduces maintenance and conversion costs • makes usage easy • of course make some PR and create attractiveness • be present in portals (Virtual Language Observatory) • of course provide easy access to content • need to speak about "maintaining cultural/scientific memory" • need to speak about grand challenges and simulation • both are data driven • how can we maintain stable life??

  11. Sustainability of Collections • technical aspect • implement a 3-tier structure • (user communities - community centers - data centers) • prevent data copying (operations on data, trust, etc) • organize preservation cost-effectively • (MPI: 4 * 50 TB at data centers < 5.000 €, costs decrease over time, • economy of scale, green computing) • separate acquisition, curation and access funding • since they can be expensive • immediate curation as a basic principle • late curation is more expensive (see Beagrie)

  12. Sustainability of Collections • responsibility aspect • funding & research organizations are responsible • need to reserve some percentage for continuous access • need a process for taking decisions • (experimental data often obsolete after N years) • need quality assessment procedures

More Related