130 likes | 281 Views
CLARIN: Common Language Resources and Technology Infrastructure for the Humanities and Social Sciences. Kimmo Koskenniemi (University of Helsinki) Steven Krauwer (Utrecht University) Tamas Varadi (Hungarian Academy of Sciences) Peter Wittenburg (Max Planck Institute, Nijmegen)
E N D
CLARIN: Common Language Resources and Technology Infrastructure for the Humanities and Social Sciences Kimmo Koskenniemi (University of Helsinki) Steven Krauwer (Utrecht University) Tamas Varadi (Hungarian Academy of Sciences) Peter Wittenburg (Max Planck Institute, Nijmegen) Martin Wynne (Oxford Text Archive) and many other contributors from the CLARIN community LREC2008
Much data in digital archives language based Only known to insiders Archives mostly unconnected Every archive has its own standards for storage and access Normally only simple retrieval of files (text, audio or video documents) Social sciences and humanities researchers are not language or speech technologists They are often not aware of the potential benefits of using language and speech technology Available tools are hard to use for non-specialist The problem LREC2008
What: Create a European infrastructure that makes language resources and technology (LRT),available to scholars of all disciplines, especially social sciences and humanities (SSH) How: Unite existing digital archives into a federation of archives with unified web access Provide language and speech technology tools as web services operating on language data in archives The CLARIN Mission LREC2008
too much fragmentation lack of coordination across countries lack of visibility lack of interoperability lack of sustainability expertise exists but not in all countries language independent tools can be shared language dependent tools can often be ported most countries not able to bear the cost Why a European infrastructure? LREC2008
Exponential growth of digital data Increasing maturity of language and speech technology: high speed & large volumes new research questions Growing interest at EU level in digital research infrastructures support for humanities and social sciences resulting in a Special EU Call for Proposals for research infrastructure initiatives in 2006/2007 Why now? LREC2008
CLARIN is an EU Infrastructure project with 4.1 Meuro funding for a 3 year preparatory phase Additional funding expected from national governments (at this moment at least 5 Meuro) The CLARIN consortium has now 32 partners from 22 EU and associated countries (and more on the waiting list) The CLARIN community has 112 member organisations in 32 countries (mostly from NLP) CLARIN is based on 4 earlier initiatives with many participants: LangWeb, EARL, TELRI (and later) DAM-LR Who we are and where we come from LREC2008
Preparatory phase: 2008-2010 Put everything in place Construction phase: 2011-2015 Build and populate with tools and resources Exploitation phase: 2016-… Overall cost until 2020: 200 Meuro Overall plan for CLARIN LREC2008
The technical dimension: specifications, prototype, validation based on existing archives, resources and services strong focus on standards and interoperability The language dimension: all languages equally important surveys of what exists and what is missing standards, taxonomies, ontologies integration of tools validation for all participating languages Planning and design (1) LREC2008
The user dimension: Investigating current practice Establishing their needs Usage scenarios Pilot projects and demonstrators Expertise centers, awareness actions Planning and design (2) LREC2008
Legal and IPR issues aim at open source, but IPR for existing and future non-open resources must be accommodated federation of archives requires authentication, authorization and trust between archives aim at limited number of template license agreements for most common cases respect national legislation address ethical issues Planning and design (3) LREC2008
Agree on e.g.: Who is going to pay for the construction and exploitation of the infrastructure How will it be managed How will it be coordinated with national policies To be agreed upon by the funding agencies (not the researchers) Planning and design (4) LREC2008
building the infrastructure – we are just preparing it creating new resources – at this stage we want to use what is there and adapt it if necessary creating applications – except maybe some demonstrators focusing on the big languages – we find all languages equally important strengthening European industry – our target audience are SSH researchers, but we don’t want to exclude anyone What CLARIN is NOT about LREC2008
The CLARIN infrastructure does not yet exist, we haven’t even started constructing it yet: we have just started planning the construction Why do we feel confident that we can succeed? Three key factors: Moral and financial support from the EC Moral support from (now 26) national governments (financial commitments expected) Massive support from the European LRT community More info: www.clarin.eu & Newsletter Concluding remarks LREC2008