LRT Repositories/Archives – State and Future

   LRT Repositories/Archives – State and Future Peter Wittenburg MPI for Psycholinguistics CLARIN Research Infrastructure

   State in LRT Domain • everywhere the same: • extreme increase of the amount and complexity of • primary/secondary research data • only very little is visible via “reasonable” portals • the state of the resources is in general bad • UNESCO: 80 % of the recordings about languages • and cultures is highly endangered • encoding, structure and terminology is not well • described/defined • only little is stored in suitable repository/archive • systems • only few institutes have a proper repository/archive • only very few give deposit services

   Types of Resources in LRT Domain • large heterogeneity of resource types • semi-structured texts (newspapers, books, etc) • transcriptions • annotated media recordings (sound, video) • (annotated) time series data • eye tracking, motion tracking, data glove, fMRI, etc • lexica (with multimedia extensions) • grammar descriptions • tree databases (syntax descriptions) • concept registries, relation registries, ontologies • metadata descriptions • schemas, component schemas • etc • referenced objects (resources, collections, fragments)

   Some exceptions • such as the DOBES/MPI archive and others from DELAMAN • few “traditional” centers such as ELDA/LDC/INL/OTA/BAS ... • only few have a clear metadata policy

   Repository “Grid” • existing repository “grid” (joint MD, PIDs, distributed AAI, data exchange) • planned extensions in 2008 • additional metadata harvesting from “centers” (OAI PMH, XML) -> OLAC

   Professional Repository and Archiving University London AIATSIS Canberra 2 Computer Centers in Munich (one from MPG) University Lund • at MPI about 33 Terabyte • > 250.000 resources • 60 Mio annotations • long term • preservation strategy • synchronized regional • archives are essential IIAP Iquitos 2 Copies MPI Nijmegen MPI Leipzig CONICET BA MdI Rio 2 Computer Centers in Göttingen (one from MPG) Belem, Tbilisi, Timor, Bangkok, Windhoek, Katmandu, Birmingham, Berlin, Halle, ... University Kiel

   What is CLARIN? – very short • create an integrated and interoperable landscape of LRT • all based on strong centers with repository/archive • strategy, variety of services and strong national support • i.e. extend what has been started on a small scale • centers will form a “kind of federation” • shared metadata is one of the key pillars

   CLARIN Initiative • CLARIN is an ESFRI Roadmap initiative in SSH • 90 member institutes from 31 EU countries • EC funded RI with 32 partners from 22 EU countries • 25 national commitment statements, i.e. many members • are directly involved due to national funding schemes • preparatory phase 3 years • in some countries already a long-term roadmap concept • much interest from non-EU countries • (Australia, Japan, Korea, US, South Africa, Brazil, • Russia, Argentina, Peru, China) • have to show what we can achieve 

   What else? • LRT is quite active community over many years • close collaboration with ISO TC37/SC4 • (in addition to standards from W3C, OASIS, TEI, etc) • standardized concept registry in progress including • multilingual terminology • various generic data models in progress • example: Lexical Markup Framework • standard for unique and persistent identifiers (Handle) • in progress • new standard for language IDs in progress • new framework for flexible metadata in progress • etc

   What else? • thus: large overlap with DRIVER • CLARIN is focusing on research data which is different • CLARIN is discipline oriented • DRIVER • started from library domain • is discipline crossing ? Grids GEANT

   Expectations wrt DRIVER • CLARIN is very much interested in a collaboration • and see where we can benefit from each other • heard a lot about OAI-PMH but this is most simple aspect • in cross-disciplinary approaches ... • understand the semantic mapping problems when • creating an integrated metadata domain • STITCH: CH sector in the NL; RoR: in Max Planck Society • figure out which vocabulary is offered to which users • is metadata used for research questions or • for “accidental” discovery • many other open questions • CLARIN would be ready for a test • many metadata descriptions ready to be harvested

   Thanks for your attention.

LRT Repositories/Archives – State and Future