220 likes | 327 Views
Sharing Resources in CLARIN-NL. Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011. Overview. Context Documentation Visibility Referability Accessibility Long Term Preservation Interoperability Conclusions. Context. CLARIN-NL
E N D
Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011
Overview • Context • Documentation • Visibility • Referability • Accessibility • Long Term Preservation • Interoperability • Conclusions
Context • CLARIN-NL • National project in the Netherlands • 2009-2015 • Budget: 9.01 m euro • Funding by NWO (National Roadmap Large Scale Infrastructures) • Coordinated by Utrecht University • 24 partners (universities, royal academy institutes, independent institutes, libraries, etc.)
Context • Dutch National contribution to the Europe-wide CLARIN infrastructure • Prepared by CLARIN preparatory project (2008-2011) • Also coordinated by Utrecht University • From Dec 2011 to be coordinated by the CLARIN-ERIC • ERIC: a legal entity at the European level specifically for research infrastructures
CLARIN infrastructure (NL) • An technical research infrastructure in which a humanities researcher who works with language-related resources • Can find all data relevant for the research • Can find all tools relevant for the research • Can apply the tools to the data without any technical background or ad-hoc adaptations • Can store data resulting from the research • Can store tools resulting from the research via one portal
CLARIN infrastructure (NL) • This requires systematicsharing of resources (=data, tools, web services, …) • Systematic Sharing requires • Documentation • Visibility • Referability • Accessibility • Long Term Preservation • Interoperability of resources
CLARIN-NL subprojects • Resource curation projects • Curate an existing resource • Demonstrator projects • Curate an existing tool and supply a demonstration scenario • #subprojects 21 (12-14 in 2012) • Data Curation Service • Offers the service of curating existing data • Where curationincludes • Documentation, Visibility, Referability, Accessibility, Long Term Preservation, Interoperability
CLARIN-NL Centres • CLARIN infrastructure is virtual and distributed • CLARIN-Centres work together to implement the infrastructure • Each stores and makes available a part of the resources • Some also provide computational facilities • Centres must meet a list of requirements and be certified by CLARIN • Candidate CLARIN Centres in NL • Institute for Dutch Lexicology (INL) • Max Planck Institute for Psycholinguistics (MPI) • Meertens Institute (MI) • Huygens ING Institute (HI) • Data Archiving and Networked Services (DANS)
Infrastructure Implementation • Implementation of basic infrastructure functionality • setting up authentication and authorizations systems • several registries (e.g. ISOCAT, RELCAT, Metadata Registry) • various other infrastructure services • Search Facilities • In resource descriptions (`metadata’) • Centralized after metadata harvesting • In the data themselves • Via federated search • Using Webservices in Workflow systems • Cooperation with Flanders • Based on work done in the STEVIN-programme • (as a severe test for interoperability)
Documentation • Is always necessary, so hardly any additional effort • Partly in natural language • Partly formalized • Described under a particular formally identifiable attribute • With an explicit type for the value of the attribute • Possibly with further restrictions on the values (patterns, finite lists of values, constraints, etc.) • Represented formally and unambiguously • Any piece of documentation that can be formalized must be formalized, and must be put in the resource description (metadata of the resource)
Documentation • Resource Descriptions • Component-based MetaData Infrastructure (CMDI) • One can define resource profiles as collections of components (which can contain components). • Many generally useable components are available • Resource profiles for most common resources are available • Component-based flexibility • Flexibility: danger: diversity, no interoperability • Controlled by semantic interoperability (see below) • Not yet available but needed: profile(s) for tools • Supported by tools • Component and profile editors • Component and profile registries • Metadata editor
Visibility • Each resource and its resource description must be stored at a CLARIN-centre • CLARIN-centres make resource descriptions available for metadata harvesting (using OAI-PMH) • Via harvesting the metadata, the metadata become available in the CLARIN resource catalogue • browsing via the Virtual Language Observatory (VLO) using faceted browsing • Search via a search interface (under development) • In the metadata and in the data • String search and structured search • Results if desired collected in a Virtual Collection
Referability • By name or title is not sufficient • All the problems that natural language poses for communication: • not always unique (ambiguity) • language-specific Corpus Gesproken Nederlands • Variants in other languages: Spoken Dutch Corpus • limited knowledge of the foreign language variants: Corpus Spoken Dutch, Dutch Spoken Corpus • Long, too redundant, • abbreviations/acronyms: CGN • Invites for errors • Spoken Dutch Cropus, Spken Dutch Corpus • URLs • Still too long/redundant (unless one uses shortened URLs) • Unstable, volatile • Persistent Identifiers (PIDs) are needed
Referability • PIDs • Each CLARIN-Centre • must assign a PID to each resource (and/or to subresources) • Keep the PID resolution registry up-to-date • PID systems • Handle (preferred) • URN • Perhaps others (e.g. DOI)
Accessibility • CLARIN infrastructure • Accessible at any time and from any place • IPR • CLARIN-NL promotes maximal open access of resources • is working on plans to implement policies and functionality to properly handle IPR and ethical restrictions • Researchers’ Mindset • Many researchers in the humanities are hesitant or even unwilling to share their resources with others • How to resolve this? With a carrot and a stick • CLARIN must accommodate reasonable wishes • CLARIN must prove benefits for researchers who put their resources there • Funding agencies must oblige researchers to do so (partially already so)
Long Term Preservation • Necessary to make sure the resources can be shared with future researchers (that may be the producer!) • Each CLARIN-Centre is obliged to ensure long term preservation • Usually outsources to specialized centres • MI outsources to DANS • MPI outsources to internal Max Planck Gesellschaft organisation
Interoperability • Interoperability of resources is the ability of resources to seamlessly work together • No manual ad-hoc adaptations • Adaptations occur automatically `behind the screens’ • Need for interoperability is high • Humanities researchers: not the required technical background • Interoperability • Syntactic interoperability and Semantic interoperability • Each subproject must try to achieve interoperability • Report any problems and make suggestions for adaptations • So that the resources are adapted to the infrastructure (in some cases) and vice-versa (in other cases) • Not easy, but the only way to get further is to actually try this and learn from it.
Syntactic Interoperability • the formats of data are selected from a limited set of (de facto) standards or best practices supported by CLARIN • software tools and applications take input and yield output in these formats
Semantic Interoperability • Focus on the semantics of Data Categories (DCs) • a privileged data category registry (DCR) is set up containing DCs: • unique persistent identifiers for DCs (PIDs), • their semantics, • a definition, • Examples • lexicalizations in various languages. • Each resource specific DC mapped to DC from the privileged DCR. • every researcher can use his/her own DCs • different DCs from different resources can be interpreted as identical in meaning, via the DC of the privileged DCR • In CLARIN-NL multiple (complementary) privileged DCRs are allowed. The primary is ISOCAT
Semantic Interoperability • Achieving semantic interoperability is very hard • Many DCs are almost identical (principled/pragmatic/arbitrary reasons) • Some DCs in ISOCAT are not defined clearly • There are many similar DCs in ISOCAT • Relevant DCs are not easy to find in ISOCAT • Three actions taken • Held several workshops to discuss problems • Appointed a coordinator to deal with problems • Decided to implement RELCAT registry to specify relations between DCs
Conclusions • CLARIN-NL requires systematic sharing of resources • Therefore requires researchers to work on • Documentation • Visibility • Referability • Accessibility • Long Term Preservation • Interoperability Of resources • For certain aspects this is relatively easy but it must be done • For other aspects this is very hard but it must be done so that we can learn • The approach described here may be a model for other countries working on the CLARIN-infrastructure • It may be a model for other resource sharing facilities (e.g. META-SHARE)