430 likes | 563 Views
The Basic Language Resources Kit (BLARK). Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET. Overview. The BLARK Enterprise How to arrive at it The Dutch Language Union approach Refining the concept Defining a BLARK Main beneficiaries References Concluding remarks.
E N D
The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET steven.krauwer@elsnet.org
Overview • The BLARK Enterprise • How to arrive at it • The Dutch Language Union approach • Refining the concept • Defining a BLARK • Main beneficiaries • References • Concluding remarks steven.krauwer@elsnet.org
The BLARK Enterprise • Define the minimal set of language resources that is necessary to do any precompetitive R&D and professional education at all for a language (the Basic Language Resource Kit or BLARK) • Determine for each language which components are already available • Make a priority plan to complete the BLARK for each language • Ensure funding to get the work done steven.krauwer@elsnet.org
What are the componentsof a BLARK • Lexicons (monolingual, multilingual, …) • Corpora (language, speech; annotated, unannotated; mono- and multilingual; mono- and multimodal; …) • Tools (annotation, exploration, …) • Modules (lemmatizers, parsers, speech recognizers, tts, transcribers, translation, …) • … steven.krauwer@elsnet.org
What makes the BLARK Enterprise special? • The idea is to make a common generic BLARK definition, in principle applicable to all languages • The common definition will be based on the experience with different languages, and will prevent reinvention of wheels • The common definition will ensure interoperability and interconnectivity (especially for multilingual or cross-lingual applications) steven.krauwer@elsnet.org
Other benefits • Experience from other languages will help making cost estimations • Adoption of a BLARK common to all languages may help in persuading funders to support the creation of the BLARK • Adoption of a common BLARK may facilitate porting of knowledge and expertise between languages steven.krauwer@elsnet.org
Words of caution • A BLARK definition will evolve over time, as new applications, application environment and technologies come up • A BLARK definition should be seen as a template rather than a dictate, as different languages may have different specific requirements • BLARK completion priorities may differ from language to language (on e.g. economic, social or political grounds) steven.krauwer@elsnet.org
How to define a BLARK and assign priorities • Methodology proposed by the Dutch Language Union [DLU] (Binnenpoorte et al, LREC 2002): • Identify a number of typical applications • Determine for each of them which technologies (modules) are needed to make them (-, +, ++, +++) • Identify for each module which resources they require (-, +, ++, +++) • Assign the highest priority to the resources that support most applications steven.krauwer@elsnet.org
Proposed DLU priorities for NLP • treebank • robust parsers • tokenisation and named entity recognition • semantic annotations for the treebank • translation equivalents • evaluation benchmarks steven.krauwer@elsnet.org
Proposed DLU priorities for speech • automatic speech recognition • application-specific speech corpora • multi-media speech corpora • tools for transcription of speech data • speech synthesis • benchmarks for evaluation steven.krauwer@elsnet.org
Next steps by DLU • Make a survey of what exists and to what extent it is available (0-9 availability score) • Assign priorities (not just resources but also an infrastructure for maintenance and distribution) • Secure funding from Dutch and Flemish government for a national programme • Issue calls for proposals for collaborative resources projects (1st call closed Nov 2 2004) steven.krauwer@elsnet.org
Refining the concept • Items not really covered by the DLU teams: • definition vs specification • availability • quality • quantity • standards • support • Addressed in the NEMLAR project steven.krauwer@elsnet.org
Definition / specification • Not enough to say ‘a written language corpus’, what about: • size (types, tokens) • encoding • annotation • text types • representativity • domains • i.e. we need full specs steven.krauwer@elsnet.org
Availability • DLU: 0-9 scale, very impressionistic • Our proposal: 3 dimensions • accessibility • cost • modifiability • to each we assign a penalty score (0 is best) steven.krauwer@elsnet.org
Accessibility • 3 classes, with associated penalties • (3) existing, but only company-internal • (2) existing and freely usable for precompetitive research • (1) existing and freely usable for all R&D steven.krauwer@elsnet.org
Cost • 4 cost categories: • (4) price over 10 keuro • (3) price between 1 and 10 keuro • (2) price between 100 and 1000 euro • (1) less than 100 euro steven.krauwer@elsnet.org
Modifiability • 3 categories • (3) black box: you get them as they are, but you cannot change or even inspect its internals • (2) glass box: you can’t change them but you can see what is inside) • (1) open resources: freely manipulable steven.krauwer@elsnet.org
Comments on availability • we can now express availability in a 3 digit score (accessibility, cost, modifiability) which should be rather easy to assign objectively • the lowest scores are the best • if the accessibility score is 3, the other scores don’t mean very much steven.krauwer@elsnet.org
Quality • We distinguish two types of quality: absolute (I.e. an inherent property of the resource) and relative (I.e. in relation to how you want to use it): • Absolute: standard-compliance and soundness • Relative: task-relevance and environment-relevance steven.krauwer@elsnet.org
Standard-compliance • criterion: to what extent is the resource based on a common standard (formal or de facto) • possible values (penalty based): • (3) no standard • (2) standard, but not fully compliant • (1) standard and fully compliant steven.krauwer@elsnet.org
Soundness • criterion: to what extent is the resource based on well-defined specifications • values: • (3) no specifications provided • (2) specs provided, but not fully compliant • (1) specs provided, fully compliant steven.krauwer@elsnet.org
Task-relevance • criterion (relative): to what extent is the resources suited for a specific task X • values (3 binary values): • contains all information needed for X (yes/no) • has the proper size for X(yes/no) • based on a relevant selection of items for X (yes/no) steven.krauwer@elsnet.org
Environment-relevance • criterion: to what extent is the resource interoperable with its environment (other resources) • values (3 binary valuas): • information matches (yes/no) • size matches (yes/no) • selection matches (yes/no) steven.krauwer@elsnet.org
Comments on quality • We can now express absolute quality objectively in terms of a pair of scores (standard-compliance, soundness); this score can be assigned by the provider • and relative quality (for our own purposes) in terms of two triples of yes/no answers (task-relevance, environment-relevance); this score can only be assigned by the user • other attributes may be added as long as they can be objectively assigned steven.krauwer@elsnet.org
Quantity • The DLU team did not try to formulate any quantitative requirements • We have tried to do this in the context of the NEMLAR project, see below for our tentative figures • Statistical approaches can swallow any amount of resources, and minimal figures are very hard to find • Our figure finding exercise has been very much example driven steven.krauwer@elsnet.org
Standards • Very few existing formal standards around, although some exist (cf Romary & Ide at LREC2004 workshop, Monachini et al, 2003) • Evolving de facto standards include: • Bottom-up work by committees (TEI) • Top-down actions: • Projects aiming at standards (e.g. EAGLES, ISLE) • Example setting R&D projects (e.g. Wordnet, Speechdat, Multext) • Our position: any standard is better than no standard at all steven.krauwer@elsnet.org
Defining a BLARK • Work carried out in the context of the NEMLAR project (www.nemlar.org), aimed at Arabic resources • Work described here based on project deliverables (see site), summarized in article by Maegaard, Krauwer, Choukri, Damsgaard presented at NEMLAR conference in Cairo (Sep 2004) steven.krauwer@elsnet.org
Approach adopted • Same strategy as Dutch Language Union (applications => modules => resources) • But with different results because of differences in social/economic situation and in language structure • Results follow, in terms of global definitions and tentative size indications (no specs provided at this stage, but project is still ongoing) • Feedback is welcome!!!!!!!! steven.krauwer@elsnet.org
Written resources (1) • Lexicon: • For all components: 40 000 stems with POS & morphology • For sentence boundary detection: list of conjunctions and other sentence starters/stoppers • For named entity recognition: 50 000 human proper names • For semantic analysis: same 40 000, with subcategorization, shallow lexical semantic info; possibly a WordNet steven.krauwer@elsnet.org
Written resources (2) • Bi-/Multilingual lexicon • Same size as monolingual • Thesauri, ontologies, wordnets: • Thesaurus subtree with ca 200-300 nodes for each domain • Ontologies and wordnets ideally same size as lexicon steven.krauwer@elsnet.org
Written resources (3) • Corpora: • For term extraction: 100 million words unannoteted • For small applications: 0.5 million words annotated • For statistical POS tagger: 1-3 million (ann) • Sentence boundary: 0.5-1.5 million (ann) • Named entity (stat based): 1.5 million (ann) • Term extraction: 100 million (ann) • Co-reference resolution: 1 million (ann) • WSD: 2-3 million (ann) steven.krauwer@elsnet.org
Written resources (4) • Multilingual corpora: • For alignment: 0.5 million (tagged) • Multimodal corpora: • For OCR (printed): ?? • For OCR (hand-written): ?? steven.krauwer@elsnet.org
Spoken resources (1) • Acoustic data: • For dictation: 50-100 speakers, 20 min each, fully transcribed, plus 10 speakers for testing • For telephony: 500 speakers uttering 50 different sentences (speechdat, orientel based) • For embedded speech recognition: data similar to Speecon • For broadcast news transcription: 50-100 hours well-annotated, plus 1000 hours of non-transcribed data; should come with 300 million words of non-annotated written text steven.krauwer@elsnet.org
Spoken resources (2) • Acoustic data (cont’d): • For conversational speech: data similar to CallHome/CallFriends from LDC • For speaker recognition: 500 speakers for training, 3 minutes each, transcribed, plus 100 speakers for testing • For language/dialect identification: data similar to CallFriend, or from Broadcast News (esp for variants of Arabic) • For speech synthesis: male and female speakers, 15 hours, using a read text, phonetically balanced • For formant synthesis: sama as above, with hand-labelled formant steven.krauwer@elsnet.org
Spoken resources (3) • Multimodal corpora: • For lips movement reading: similar to M2VTS, with some 50 faces • Written corpora for speech technologies: • General; 300 million words unannotated, preferably broadcast news or other press and media sources • For phonetic lexicon and language models: 1-5 million words, annotated • For Arabic: vowelized and non-vowelized corpus steven.krauwer@elsnet.org
What next? (1) • Check definition and quantification for completeness and consistency and correct • Try to provide specs for every single item • Try to differentiate between general and Arabic in definitions and specs steven.krauwer@elsnet.org
What next? (2) • For each language: • Take the BLARK definition and specs • Adapt to local conditions • Make a survey of what exists and what has to be made • Find the funds and build the BLARK for your language steven.krauwer@elsnet.org
Prescriptive / descriptive • Prescriptive: • the BLARK definition tells you which ingredients you need • the specification tells you what they should look like • Descriptive: • a BLARK instantiation comes with a description of its components steven.krauwer@elsnet.org
Main beneficiaries (1) • academic and industrial researchers: material to try out ideas and conduct pilot studies • industrial developers: only for generic activities, since specific applications require more user and domain orientation • educators: material for experimental work by students in labs steven.krauwer@elsnet.org
Main beneficiaries (2) • probably not the main languages in Europe (EN, FR, GE) as they are pretty well covered anyway • mostly the languages that are not supported by a strong market (because of small size or poor economy) steven.krauwer@elsnet.org
References • Binnenpoorte et al at LREC 2002 (see also www.elsnet.org/dox/lrec2002-binnenpoorte.pdf • ELRA Newsletter vol 3, n 2, 1998 (see also www.elsnet.org/blark.html) • NEMLAR: see www.nemlar.org for • Arabic BLARK Report • NEMLAR presentation at Cairo conference • Romary & Ide at LREC 2004 (see also www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt) steven.krauwer@elsnet.org
Concluding remarks • The BLARK aims at providing a common definition of the notion ‘minimal set of resources’ • It should help language communities to come closer to the idea of creating an equal playing field, in spite of market forces • It should facilitate porting of expertise • It is necessarily dynamic, as technologies evolve rapidly steven.krauwer@elsnet.org
Thanks! Contact: steven.krauwer@elsnet.org steven.krauwer@elsnet.org