290 likes | 428 Views
CLARIN: The common language resources and technology infrastructure. Steven Krauwer CLARIN Coordinator Utrecht institute of Linguistics UiL-OTS (NL). Problem & Mission Some why-questions Some who-questions Overall plan What CLARIN is NOT about How we work Funding Structure
E N D
CLARIN: The common language resources and technology infrastructure Steven Krauwer CLARIN Coordinator Utrecht institute of Linguistics UiL-OTS (NL)
Problem & Mission Some why-questions Some who-questions Overall plan What CLARIN is NOT about How we work Funding Structure Where we stand Some dreams To conclude Overview CLARIN - Barcelona 06-02-2009
Much data in digital archives language based Existence often only known to insiders Archives mostly unconnected, even at the national level Every archive has its own standards for storage and access Normally only simple retrieval of files (text, audio or video documents) Other tools exist but are hard to use for non-specialist Social sciences and humanities researchers are not language or speech technologists They are often not aware of the potential benefits of using language and speech technology The problem CLARIN - Barcelona 06-02-2009
What: Create an infrastructure that makes language resources and technology (LRT),available to scholars of all disciplines, especially social sciences and humanities (SSH) How: Unite existing digital archives into a European federation of archives with unified web access Provide existing language and speech technology tools as web services operating on language data in archives The CLARIN Mission CLARIN - Barcelona 06-02-2009
Towards strongand persistent centers • need to add a persistent infrastructure layer on top of the existing landscape which is formed by accidental and temporary collaborations • should be easily accessible for everyone • should offer high availability (always on-line) so that people can rely on it • will be different types of centers dependent on the service • need strong national support for many years CLARIN - Barcelona 06-02-2009
too much fragmentation lack of coordination across countries lack of visibility lack of interoperability lack of sustainability expertise exists but not in all countries language independent tools can be shared language dependent tools can often be ported most countries not able to bear the cost Why a European infrastructure? CLARIN - Barcelona 06-02-2009
Exponential growth of digital data Increasing maturity of language and speech technology: high speed large volumes new research questions Growing interest at EU level in Research Infrastructures (RI), also for soft sciences RI Roadmap published in 2006 by ESFRI includes 35 accepted proposals for RIs CLARIN is one of them and has EC funding for a 1-3 year preparatory phase Why now? CLARIN - Barcelona 06-02-2009
The CLARIN consortium has now 32 partners from 22 EU and associated countries (and more on the waiting list) The CLARIN community has 148 members in 32 countries (Feb 2009) CLARIN is based on 4 earlier broad European initiatives with many participants: LangWeb EARL TELRI (and later) DAM-LR Who we are and where wecome from CLARIN - Barcelona 06-02-2009
Both our membership and our consortium are quite unbalanced: Written language technology over-represented Speech & multimodality under-represented Humanities other than linguistics under-represented Social sciences under-represented Some countries and languages (national and regional) still missing There is no money to extend the consortium but we have to fill these gaps to ensure balanced coverage Who else do we need? CLARIN - Barcelona 06-02-2009
Preparatory phase (2008-2010): Put everything in place Construction phase (2011-2015): Build and populate with tools and resources Exploitation phase (2016-….): CLARIN in full service Budget Prep phase 4.1 M€ from EC ??? from countries (process still ongoing) Estimated budget until 2020: ca 200 M€ mostly from national and regional funding agencies max 20% from EC (not yet formally decided) Overall plan for CLARIN CLARIN - Barcelona 06-02-2009
First 3 years dedicated to the design: The technical dimension The language dimension The user dimension The governance and legal dimension 4-dimensional approachin the preparatory phase CLARIN - Barcelona 06-02-2009
Technical specification of the infrastructure Construction of a prototype Validation on rich variety of languages (>20) resources services Federation of existing archives Based on existing resources, tools Strong focus on interoperability standards Conversion of existing resources Encapsulation of existing tools Technical CLARIN - Barcelona 06-02-2009
Cover all languages spoken or studied in participating countries, including regional languages Representational and descriptive standards should be adequate and validated for all languages Same minimal coverage of basic resources and tools for all (living) languages BLARK (Basic Language Resources Toolkit) to be defined and implemented (funds from other sources needed) Languages CLARIN - Barcelona 06-02-2009
Activities during preparatory phase survey of resources and tools, including: encoding and annotation data quality indicators developing taxonomies and ontologies agreeing on common standards Focus on integration of tools interoperability usage scenarios creating missing essential resources validating specifications and prototype Language technology activities CLARIN - Barcelona 06-02-2009
Users are SSH scholars (including linguists, translation experts) Do WE know what they need? Do THEY know what they need? Actions: analyze past and ongoing SSH projects user consultation launch typical example projects to show potential (see Call for Humanities Projects) expertise centers awareness actions User CLARIN - Barcelona 06-02-2009
IPR and ethical issues aim at open source, but IPR for existing and future non-open resources must be accommodated federation of archives requires authentication, authorization and trust between archives aim at limited number of template license agreements for most common cases respect national legislation address ethical issues Legal and ethical CLARIN - Barcelona 06-02-2009
Agree on e.g.: Who is going to pay for the construction and exploitation of the infrastructure How will it be managed How will it be coordinated with national policies Actions: Analyse best practice in funding and management of transnational projects Prepare agreement between (now) 22 countries about long term joint funding of CLARIN Governance andFunding CLARIN - Barcelona 06-02-2009
building the infrastructure – during this phase we are just preparing it creating new resources – at this stage we want to use what is there and adapt it if necessary creating new applications – except maybe some essential tools or demonstrators focusing on the big languages – we find all languages equally important strengthening European industry – our target audience are SSH researchers, but we don’t want to exclude anyone What CLARIN is NOT (yet) about CLARIN - Barcelona 06-02-2009
Work packages: WP1: Management and coordination WP2: Designing the infrastructure and building a prototype WP3: Humanities overview WP5: Language resources and technology overview WP6: Dissemination WP7: IPR and business models WP8: Construction and exploitation agreement How we work (1) CLARIN - Barcelona 06-02-2009
WP8 Org&Legal Framework 5 1 8 WP7 IPR, A&A, licensing 4 WP2 Infrastructure Prototype 6 3 2 WP3 Humanities Projects WP5 LRT Exploration 7 How we work (2) CLARIN - Barcelona 06-02-2009
Most tasks executed in Working Groups (WGs) WGs consist of project partners & other experts (CLARIN is open!) Some WGs do work (e.g. build prototype), others collect data or create consensus Participation by others essential as e.g. standards cannot be imposedby a small group Unfortunately no EC funding available for WG participation – only reward is influence! How we work (3) CLARIN - Barcelona 06-02-2009
From EC: 4.1 M€, used for generic, language independent tasks From countries: ??? M€, to be used for preparing CLARIN at the national or regional level in every country: build and organize local national CLARIN communities support for participation in working groups (e.g. travel) validation tasks for own language(s) creation or adaptation of essential resources pilots and demonstrators & humanities projects (co-)organisation of local or international events preparing for future role (expertise centers, repositories) Funding & what to use it for CLARIN - Barcelona 06-02-2009
Executive Board, consisting of the 7 WP leaders plus a special representative to liaise with the humanities community (a.o. through the DARIAH sister project) Boards: Scientific Board Strategic Coordination Board International Advisory Board Meetings (virtual or face to face): Consortium meetings Member meetings Working group meetings Structure CLARIN - Barcelona 06-02-2009
We have just finished the 1st year (still 2 to go) Various working groups have been set up and are already active – but you can still join: http://www.clarin.eu/join-a-working-group We have regular workshops on various topics: see http://www.clarin.eu/all_events Public documents are published on http://www.clarin.eu/documents We have just launched a Call for Humanities Projects http://www.clarin.eu/wp3/wp3-documents/call_final-version Where we stand CLARIN - Barcelona 06-02-2009
An example: Ethnologists have a recording of a dance with singing, and a transcription; they want to search for certain textual patterns and then return to the corresponding recorded dance fragments For a 3 minutes recording no problem 30 minutes might just be doable … but what about 3, 30 or 300 hours of video? To do this and to save time they would need to align media and transcriptions There are “aligner tools” But who is able to use them and will they work on the transcription format? Our dreams CLARIN - Barcelona 06-02-2009
Another example: Historians want to access all material from physics, politics and sociology to understand the reasons for the marine dominance of the Serene Republic of Venice to do this they need to search for concepts in all material, extract summaries, relate fragments, add and exchange comments etc they need to do this collaboratively currently this involves a huge amount of handwork to overcome institutional, linguistic (morphological normalization, translation), semantic boundaries but who is able to carry out such work, who can operate the tools … more dreams … CLARIN - Barcelona 06-02-2009
One day any SSH scholar should be able to ask without any difficulty: “List all uses of enthusiasm in 19th century English novels written by women” “Find all video clips of Prince Charles talking about architecture in 2007” “Summarize the inaugural speech of Obama - in Catalan” … and more CLARIN - Barcelona 06-02-2009
CLARIN is a long term endeavour with lots of challenges of very different types For the medium and longer term I see the following main challenges (where we could really fail): Agreeing on standards and actually using them Persuading users to formulate requirements and to use the infrastructure Making the CLARIN infrastructure resistent to technological developments Securing long term funding In CLARIN there is room for all languages If it succeeds it will give a boost to SSH research To conclude (1) CLARIN - Barcelona 06-02-2009
More information: CLARIN Website: http://www.clarin.eu CLARIN Office: clarin@clarin.eu CLARIN Newsletter (issue 4 just out): http://www.clarin.eu/newsletter CLARIN Members & how to join: http://www.clarin.eu/members Thanks! To conclude (2) CLARIN - Barcelona 06-02-2009