1 / 17

CLARIN Common Language Resources and Technology Infrastructure

CLARIN Common Language Resources and Technology Infrastructure. Daan Broeder & Dieter van Uytvanck Max-Planck Institute for Psycholinguistics. TF-EMC2 Meeting, Dec 3 2008.

mstier
Download Presentation

CLARIN Common Language Resources and Technology Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLARINCommon Language Resources and Technology Infrastructure Daan Broeder & Dieter van Uytvanck Max-Planck Institute for Psycholinguistics TF-EMC2 Meeting, Dec 3 2008

  2. The CLARIN project is a large-scale pan-European collaborative effort to coordinate and make language resources and technology available and readily useable for Language & SSH (Social Sciences & Humanities) researchers. Resources: Lexica, text corpora, multi-media/multi-modal recordings, … Technology: applications & (web-)services as parsers, tokenizers, speech recognizers & segementators, … What is CLARIN TF-EMC2 Meeting, Dec 3 2008

  3. Existence and location of resources only known to insiders Archives mostly unconnected islands Every archive has its own standards for storage and access Normally need to download first when processing resources Social sciences and humanities researchers are not language or speech technologists They are often not aware of the potential benefits of using language and speech technology Available tools are hard to use for non-specialist The problem TF-EMC2 Meeting, Dec 3 2008

  4. CLARIN is an EU Infrastructure project with 4.2 ME funding for a 3 year preparatory phase Additional funding from national governments (at this moment at least 7 ME + 9 ME) The CLARIN consortium has now 32 partners from 26 EU countries The CLARIN community has 146 member organizations in 32 countries (mostly from NLP organizations) CLARIN is based on earlier initiatives with many participants: LangWeb, EARL, TELRI, LIRICS and more recent DAM-LR CLARIN overview TF-EMC2 Meeting, Dec 3 2008

  5. CLARIN Organization TF-EMC2 Meeting, Dec 3 2008

  6. 2008 - 2010 Preparatory Phase Limited set of federated centers (10+) Showcases, demonstrators WP8: Investigate embedding in national funding schemes for construction phase & maintenance 2010 - 2020 Construction Phase No important European funding Depend on national project commitments 2020? - … Maintenance Phase Time plan TF-EMC2 Meeting, Dec 3 2008

  7. CLARIN “Holy Grail” User Scenario A researcher authenticates himself with his own organization and creates a “virtual” collection of resources from different repositories. He does this on the basis of browsing a catalogue, searching through metadata, or searching in resource content. He is then able to use a workflow specification tool and process this virtual collection with possibly a mix of home grown and remote service components. Resulting data can be added to the origin repositories with proper access rights and the “virtual” collection specification can be stored for future reference. For our domain this is very ambitious and challenging, but even a partial realization is worthwhile! TF-EMC2 Meeting, Dec 3 2008

  8. DAM-LR EU project (2005-2007)    Small EU project on archive integration of 4 partners corpus/computational linguistics and endangered language documentation • Resource discovery: sharing a single metadata set for searching & browsing • Authentication & Authorization: single user identity, single sign-on by using Shibboleth. • Referencing and citing “archived resources” using a single persistent identifier system. TF-EMC2 Meeting, Dec 3 2008

  9. AAI & Federation issues Experiences from DAM-LR wrt. to AAI: Standard eduP. attr. set is probably sufficient, (but CCs …) Shibboleth is nice when using web applications, but applications need access too! Shibboleth efficient when dealing with groups e.g. staff, student, … But our domain has also to deal with individuals -> store user IDs in authorization records DAM-LR federation of both IdPs & SPs, CLARIN aims at a much larger potential user group whose home organizations do not want to run a CLARIN specific IdP -> use the national IDFs TF-EMC2 Meeting, Dec 3 2008

  10. CLARIN Federation Infrastructure I • CLARIN wants to be a LR&T “service federation” • simplified and unified rules for licensing, accessing • agreements with national identity federations • must make sure all necessary attributes are available • cater also for AA • of non-web applications • and web services • interaction with GRID AAI eJournal Service Providers Trust Agreements national Identity Federations Trust Agreement LRT Service Providers TF-EMC2 Meeting, Dec 3 2008

  11. Applications need Authentication too The application speaks only HTTP with basic authentication It does not understand form based authentication employed by the Shib. IdP Shib. apache Shib. apache archiveB archiveA IMDI copier The application is also not able to profit from the SSO over archives user application IdP Possible solution: Use certificates for authentication Obtained by SLCS But can auth. handshake be mimicked by sw User scenario: Copying resources from different repositories to the local machine TF-EMC2 Meeting, Dec 3 2008

  12. Searching through annotations The scenario of searching through the content of just one archive is no problem there is just one SP that needs to check the if the user has access to the annotations. Auth DB Search service IdP DB/SE CHAT Shoebox EAF MPI Archive Parsers “normalize” the structural format TF-EMC2 Meeting, Dec 3 2008

  13. Searching through annotations Federative search scenario IdP Auth DB Auth DB Specialized web portal Search service Search service DB/SE DB/SE CHAT Shoebox CHAT EAF MPI Archive Archive B Parsers “normalize” the structural format The web portal app would like to act on behalf of the user and access the search services. TF-EMC2 Meeting, Dec 3 2008

  14. Licenses & Code of conducts 1 SP requires CC signed and takes care of this but only for its own domain This can break the SSO if the user is required to sign the same CC several times CC DB SPa CC DB SPb browser user IdP CLARIN will harmonize the CCs and licenses to a limited number TF-EMC2 Meeting, Dec 3 2008

  15. Licenses & Code of conducts 2 SPa SPb browser user • Store the CC DB info in the user attributes at the IdP • But how does it get there? • Special app? • Not every IdP will/can run this IdP CC DB TF-EMC2 Meeting, Dec 3 2008

  16. Licenses & Code of conducts 3 SPa SPb browser CC service user CC DB IdP Create special CC service. This is part of the SPF independent of the IDFs TF-EMC2 Meeting, Dec 3 2008

  17. The EndThank you for your attention More info: www.clarin.eu

More Related