210 likes | 326 Views
Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years -. Peter Wittenburg , André Moreira The Language Archive - Max Planck Institute CLARIN European Research Infrastructure. Content . CLARIN vs. DOBES - differences? Tools vs. Standards - differences?
E N D
Standards and Tools: DOBES and CLARIN Views - resumé after about 8 years - Peter Wittenburg, André Moreira The Language Archive - Max Planck Institute CLARIN European Research Infrastructure
Content CLARINvs. DOBES - differences? Tools vs. Standards - differences? Overall Comparison TLA Team - Landscape and Strategy Technology - Mainstream influences Conclusions
DOBESvs. CLARIN • DOBES is about the documentation of endangered languages • (as many other comparable initiatives) • documentation teams are under time pressure • thus efficiency is required (transcription: 1-35, translation: 1-25) • can be facilitated by good tools • documentation certainly is for this generation of • researchers, speech communities, students, public, etc. • (primary focus of DOBES and teams) • documentation is also for future generations • documents part of our cultural heritage • languages encode knowledge about natures and cultures • historical material helps finding our identity • therefore DOBES has a short-term and a long-term challenge
DOBESvs. CLARIN • CLARIN is about an interoperable + persistent infrastructure for LRT • landscape is fragmented and nothing fits together • thus researchers working on data can't be efficient • (knowledge workers spend 40% of time on finding resources, • making things compatible etc) • can be facilitated by good standards and agreements • infrastructure certainly is for this/next generation of • researchers, students, "citizen scientists", etc. • enable "better" research if it is "data-driven" • infrastructure is also for future generations • ensuring access to our research records • lots of data is highly endangered !!! • comparing "old" data with "new" data • therefore CLARIN has a short-term and a long-term challenge
DOBESvs. CLARIN: interoperability • DOBES • community of documenting field linguists • is interoperability an issue? well I still don't know • interoperable with whom? • cross-corpus work based on data is still to come • of course some practical barriers (language) • CLARIN • infrastructure covering "all" language resources & tools • (named entity recognition relevant for everyone) • is interoperability an issue: YES - it's in the focus • otherwise always barriers to tackle relevant questions • otherwise data-driven research too expensive • seems that here is a clear difference in primary objectives
Toolsvs. Standards • who dears to doubt that • tools determine our "productivity" • tools influence attractiveness of solutions • people are used to tools - who wants to learn new stuff? • tools need to be egocentrically built • development is expensive (UI) • fast development cycles are necessary • SW management is very expensive and • eats up person power • ~ 80 % of all software developments fail • lot of SW developed will die • quickly since not enough • money to maintain it • tools have a short lifecycle • of in average about 10 years functionality time
Tools vs. Standards • who dears to doubt that • standards live almost forever • de facto lifetime comparatively high • standards are in general not attractive for users • except for some XML "fans" • standards should be hidden and only experts • need to read all documents • standards building has some form of • altruism (if big industry is not involved) • costs lot of time and effort • (ISO TC37/SC4 started 2002 at LREC) • risk of being quickly outdated • will a standard be accepted? • implementing standards in tools can be expensive • (moving target, complexity of standard, etc)
all together • for CLARIN no separation - symbiosis between short-term tool • support and long-term interoperability facilitation • for DOBES there seems to be a difference
Landscape for TLA Team • being archivist and providing access to stored material in DOBES (+MPI) • being in the core of CLARIN/EUDAT infrastructure development • a few major questions: • how can we preserve bit streams and interpretability over long period? • how can we give access to heterogeneous resources and also • support resource creation and manipulation/enrichment? • have about 71 lexica (and many different annotation types) • 61 in the archive, 10 active in LEXUS • created by different tools, • using different structures • using different categories (lexical attributes) • how can we build "generic" tools and frameworks that can cope with • heterogeneity - cannot build/maintain SW too specifically targeted? • how can we build SW in a scenario where there are so many smart • developers out there?
Strategy for TLA Team • Rule 1: have a coherent archive of 34/75 TB • i.e. convert "everything" to stable formats with explicit syntax/encoding • and check quality • otherwise long term curation and access too expensive • costs for late curation and manual migration are extreme • Rule 2: base tool development on open and "generic" formats • EAF for annotations turned out to be flexible enough over 10 years • LMF is a flexible model for lexicon structures • "LEGO" approach makes some people frightened • but flexibility not even sufficient for field linguists • yet no agreement on an exchange format - a disaster • ISOcat for registering semantics (is it generic enough?) • Rule 3: provide converters and interfaces for major tools/formats • Toolbox, CLAN, Transcriber, PRAAT, other XML • time consuming effort (cyclic flow almost impossible)
Is our Strategy Successful? • very difficult to answer - what are the criteria? • strategy allows us to be coherent with both DOBES and CLARIN • strategy was broad enough to help establishing TLA • although • LMF turned out to be very expensive for us • much time investment to participate in x meetings • little understanding from NLP hardcore guys • can't even claim to be 100% compliant or? • some years of instability of the model thus changes of code • thus slowing down development • invent own interchange format for archiving purposes (RELISH ??) • modern lexica are complex objects with inclusions of objects • (images, a/v fragments, internal and archived resources, etc) • finally an approach based on flexible standards will pay off • but it takes more time
Technology (IT) Issues • technology innovation is moving ahead with the web as driving force • designs and tools need to be web-ready • visibility from everywhere • access from everywhere • collaboration support • annotation (incl. relation drawing) support • (there are so many knowledgeable people around) • web-technology subject of high innovation rate • frequent re-design of components • what is the stable core to keep costs low and make code • maintenance feasible?
Conclusions • research communities naturally more interested in tools • research infrastructure work needs to find a balance between • short- and long-term aspects • however, need to store data following general IT principles • explicit syntax, declared semantics, open formats • need to build better tools to support standards • and/or to convince companies to adopt standards • but tool building based on standards can be more expensive and • time consuming • RELISH is very good to compare TEI, LMF and LIFT • RELISH is very good to compare ISOcat and GOLD • we need a strategy for TLA to support one (or two) exchange formats and • one needs to be based on a standard (data will go into the archive)