200 likes | 306 Views
CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities. Steven Krauwer Utrecht institute of Linguistics UiL-OTS (NL) INFuture, Zagreb Nov 7 2007. Problem & Mission Some why-questions Approach How we work and who we are Why this talk
E N D
CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics UiL-OTS (NL) INFuture, Zagreb Nov 7 2007
Problem & Mission Some why-questions Approach How we work and who we are Why this talk Summing up Overview INFuture 2007, Zagreb
Much data in digital archives language based Many archives only known to local insiders and mostly unconnected Every archive has its own standards for storage and access, normally only simple retrieval of files (text, audio or video documents) Social sciences and humanities researchers are often not aware of the potential benefits of using language and speech technology tools, and these tools are hard to use for non-specialist The problem INFuture 2007, Zagreb
What: Create an infrastructure that makes language resources and technology (LRT)available to scholars of all disciplines, especially social sciences and humanities (SSH) How: Unite existing digital archives into a federation of connected archives with unified web access Provide language and speech technology tools as web services operating on language data in archives The CLARIN Mission INFuture 2007, Zagreb
too much fragmentation lack of coordination lack of visibility lack of interoperability lack of sustainability expertise exists but not in all countries language independent tools can be shared language dependent tools can often be ported most countries not able to bear the cost Why a European infrastructure? INFuture 2007, Zagreb
Exponential growth of digital data Maturity of language and speech technology: allows for high speed processing allows for large volumes allows for new research questions Growing interest at EU level in research infrastructures (RI) for the ERA ESFRI RI Roadmap published in 2006 includes 34 proposals for RIs all of them will get EC funding for a 1-3 year preparatory phase Why now? INFuture 2007, Zagreb
Preparatory phase 2008 – 2010: Put everything in place to get started for real Build prototype Budget in preparatory phase 4.1 M€ from EC ??? M€ from participating countries Construction phase 2011 – 2015: Build and populate with tools and resources Exploitation phase 2016 - …. CLARIN in full service Overall budget 2008 - 2020: ca 200 M€ Overall plan for CLARIN INFuture 2007, Zagreb
The technical dimension The language dimension The user dimension The governance and legal dimension 4-dimensional approach for the prep phase INFuture 2007, Zagreb
Technical specification of the infrastructure Construction of a prototype Validation on rich variety of languages (>20) resources services based on existing resources and tools (i.e. not a digitization or tools creation project) Strong focus on interoperability standards Conversion of existing resources Encapsulation of existing tools Technical INFuture 2007, Zagreb
Strong sustainable centers INFuture 2007, Zagreb
Intention to cover all languages spoken or studied in participating countries Representational and descriptive standards should be adequate and validated for all languages Same minimal coverage of basic resources and tools for all languages is to be defined (and implemented if additional funds are available) Languages INFuture 2007, Zagreb
Survey of resources and tools, including: encoding and annotation data quality indicators agreeing on taxonomies and ontologies agreeing on common standards Focus on integration of tools interoperability usage scenarios if possible creation of missing essential resources validating specifications and prototype Language activities INFuture 2007, Zagreb
Users are SSH scholars Do WE know what they need? Do THEY know what they need? Actions: analyze past and ongoing SSH projects user consultation launch typical example projects to show potential create expertise centers awareness actions User INFuture 2007, Zagreb
Agree on e.g.: Who is going to pay for the construction and exploitation of the infrastructure How will the costs be shared How will it be managed How will it be coordinated with national policies Actions: Analyse best practice in funding and management of transnational projects Prepare agreement between (now) 22 countries about long term joint funding of CLARIN Set up IPR framework Governance, fundingand legal issues INFuture 2007, Zagreb
Most tasks executed in Working Groups WGs consist of project partners & other experts (CLARIN is open for contributions by others!) Some WGs do work (e.g. build prototype), others create consensus Participation by others essential as e.g. standards cannot be imposed by a small group Unfortunately no funding available for WG participation by others – only influence! How we work INFuture 2007, Zagreb
The CLARIN consortium has 32 partners from 22 EU and associated countries, including Croatia (FFZG) The CLARIN community has 92 members in 32 countries (Nov 07) Leading partners are: Utrecht University (Steven Krauwer coordinator) Max Planck Institute Nijmegen (Peter Wittenburg) Hungarian Academy of Sciences (Tamas Varadi) Who we are INFuture 2007, Zagreb
EC funds managed by consortium, will pay for generic tasks (e.g. research, prototyping, coordination, dissemination) participation by a single national coordination point in every country (in HR: FFZG Zagreb) National funds to be managed nationally, will pay for participation by other sites in the country taking care of own language and priorities (standards, & validation, adaptation of tools & resources) carrying out example humanities projects (hopefully) participating in Working Groups National vs EC funding INFuture 2007, Zagreb
Invitation to join CLARIN: We need user involvement We need archives willing to join the federation We need experts for our centers of expertise We need example humanities projects for the preparatory phase Why this talk? INFuture 2007, Zagreb
CLARIN is about to embark on its 3 year Preparatory Phase project aimed at designing and building an LRT infrastructure for the SSH It can only work with support from the whole SSH community, both inside and outside the EU Please join us if you feel you can and want to contribute. We don’t pay you but don’t charge you either – it’s free! Contact: http://www.clarin.eu, steven.krauwer@let.uu.nl or your national contact point Summing up (1) INFuture 2007, Zagreb
One day any SSH scholar should be able to ask without any difficulty: “List all uses of enthusiasm in 19th century English novels written by women” “Find all video clips of Tony Blair on BBC in 2007” “Summarize Le Monde of October 7th 2007 – in Croatian” Summing up (2) INFuture 2007, Zagreb