1 / 43

The Basic Language Resources Kit (BLARK)

The Basic Language Resources Kit (BLARK). Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET. Overview. The BLARK Enterprise How to arrive at it The Dutch Language Union approach Refining the concept Defining a BLARK Main beneficiaries References Concluding remarks.

ernie
Download Presentation

The Basic Language Resources Kit (BLARK)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Basic Language Resources Kit (BLARK) Steven Krauwer Utrecht Institute of Linguistics UiL OTS / ELSNET steven.krauwer@elsnet.org

  2. Overview • The BLARK Enterprise • How to arrive at it • The Dutch Language Union approach • Refining the concept • Defining a BLARK • Main beneficiaries • References • Concluding remarks steven.krauwer@elsnet.org

  3. The BLARK Enterprise • Define the minimal set of language resources that is necessary to do any precompetitive R&D and professional education at all for a language (the Basic Language Resource Kit or BLARK) • Determine for each language which components are already available • Make a priority plan to complete the BLARK for each language • Ensure funding to get the work done steven.krauwer@elsnet.org

  4. What are the componentsof a BLARK • Lexicons (monolingual, multilingual, …) • Corpora (language, speech; annotated, unannotated; mono- and multilingual; mono- and multimodal; …) • Tools (annotation, exploration, …) • Modules (lemmatizers, parsers, speech recognizers, tts, transcribers, translation, …) • … steven.krauwer@elsnet.org

  5. What makes the BLARK Enterprise special? • The idea is to make a common generic BLARK definition, in principle applicable to all languages • The common definition will be based on the experience with different languages, and will prevent reinvention of wheels • The common definition will ensure interoperability and interconnectivity (especially for multilingual or cross-lingual applications) steven.krauwer@elsnet.org

  6. Other benefits • Experience from other languages will help making cost estimations • Adoption of a BLARK common to all languages may help in persuading funders to support the creation of the BLARK • Adoption of a common BLARK may facilitate porting of knowledge and expertise between languages steven.krauwer@elsnet.org

  7. Words of caution • A BLARK definition will evolve over time, as new applications, application environment and technologies come up • A BLARK definition should be seen as a template rather than a dictate, as different languages may have different specific requirements • BLARK completion priorities may differ from language to language (on e.g. economic, social or political grounds) steven.krauwer@elsnet.org

  8. How to define a BLARK and assign priorities • Methodology proposed by the Dutch Language Union [DLU] (Binnenpoorte et al, LREC 2002): • Identify a number of typical applications • Determine for each of them which technologies (modules) are needed to make them (-, +, ++, +++) • Identify for each module which resources they require (-, +, ++, +++) • Assign the highest priority to the resources that support most applications steven.krauwer@elsnet.org

  9. Proposed DLU priorities for NLP • treebank • robust parsers • tokenisation and named entity recognition • semantic annotations for the treebank • translation equivalents • evaluation benchmarks steven.krauwer@elsnet.org

  10. Proposed DLU priorities for speech • automatic speech recognition • application-specific speech corpora • multi-media speech corpora • tools for transcription of speech data • speech synthesis • benchmarks for evaluation steven.krauwer@elsnet.org

  11. Next steps by DLU • Make a survey of what exists and to what extent it is available (0-9 availability score) • Assign priorities (not just resources but also an infrastructure for maintenance and distribution) • Secure funding from Dutch and Flemish government for a national programme • Issue calls for proposals for collaborative resources projects (1st call closed Nov 2 2004) steven.krauwer@elsnet.org

  12. Refining the concept • Items not really covered by the DLU teams: • definition vs specification • availability • quality • quantity • standards • support • Addressed in the NEMLAR project steven.krauwer@elsnet.org

  13. Definition / specification • Not enough to say ‘a written language corpus’, what about: • size (types, tokens) • encoding • annotation • text types • representativity • domains • i.e. we need full specs steven.krauwer@elsnet.org

  14. Availability • DLU: 0-9 scale, very impressionistic • Our proposal: 3 dimensions • accessibility • cost • modifiability • to each we assign a penalty score (0 is best) steven.krauwer@elsnet.org

  15. Accessibility • 3 classes, with associated penalties • (3) existing, but only company-internal • (2) existing and freely usable for precompetitive research • (1) existing and freely usable for all R&D steven.krauwer@elsnet.org

  16. Cost • 4 cost categories: • (4) price over 10 keuro • (3) price between 1 and 10 keuro • (2) price between 100 and 1000 euro • (1) less than 100 euro steven.krauwer@elsnet.org

  17. Modifiability • 3 categories • (3) black box: you get them as they are, but you cannot change or even inspect its internals • (2) glass box: you can’t change them but you can see what is inside) • (1) open resources: freely manipulable steven.krauwer@elsnet.org

  18. Comments on availability • we can now express availability in a 3 digit score (accessibility, cost, modifiability) which should be rather easy to assign objectively • the lowest scores are the best • if the accessibility score is 3, the other scores don’t mean very much steven.krauwer@elsnet.org

  19. Quality • We distinguish two types of quality: absolute (I.e. an inherent property of the resource) and relative (I.e. in relation to how you want to use it): • Absolute: standard-compliance and soundness • Relative: task-relevance and environment-relevance steven.krauwer@elsnet.org

  20. Standard-compliance • criterion: to what extent is the resource based on a common standard (formal or de facto) • possible values (penalty based): • (3) no standard • (2) standard, but not fully compliant • (1) standard and fully compliant steven.krauwer@elsnet.org

  21. Soundness • criterion: to what extent is the resource based on well-defined specifications • values: • (3) no specifications provided • (2) specs provided, but not fully compliant • (1) specs provided, fully compliant steven.krauwer@elsnet.org

  22. Task-relevance • criterion (relative): to what extent is the resources suited for a specific task X • values (3 binary values): • contains all information needed for X (yes/no) • has the proper size for X(yes/no) • based on a relevant selection of items for X (yes/no) steven.krauwer@elsnet.org

  23. Environment-relevance • criterion: to what extent is the resource interoperable with its environment (other resources) • values (3 binary valuas): • information matches (yes/no) • size matches (yes/no) • selection matches (yes/no) steven.krauwer@elsnet.org

  24. Comments on quality • We can now express absolute quality objectively in terms of a pair of scores (standard-compliance, soundness); this score can be assigned by the provider • and relative quality (for our own purposes) in terms of two triples of yes/no answers (task-relevance, environment-relevance); this score can only be assigned by the user • other attributes may be added as long as they can be objectively assigned steven.krauwer@elsnet.org

  25. Quantity • The DLU team did not try to formulate any quantitative requirements • We have tried to do this in the context of the NEMLAR project, see below for our tentative figures • Statistical approaches can swallow any amount of resources, and minimal figures are very hard to find • Our figure finding exercise has been very much example driven steven.krauwer@elsnet.org

  26. Standards • Very few existing formal standards around, although some exist (cf Romary & Ide at LREC2004 workshop, Monachini et al, 2003) • Evolving de facto standards include: • Bottom-up work by committees (TEI) • Top-down actions: • Projects aiming at standards (e.g. EAGLES, ISLE) • Example setting R&D projects (e.g. Wordnet, Speechdat, Multext) • Our position: any standard is better than no standard at all steven.krauwer@elsnet.org

  27. Defining a BLARK • Work carried out in the context of the NEMLAR project (www.nemlar.org), aimed at Arabic resources • Work described here based on project deliverables (see site), summarized in article by Maegaard, Krauwer, Choukri, Damsgaard presented at NEMLAR conference in Cairo (Sep 2004) steven.krauwer@elsnet.org

  28. Approach adopted • Same strategy as Dutch Language Union (applications => modules => resources) • But with different results because of differences in social/economic situation and in language structure • Results follow, in terms of global definitions and tentative size indications (no specs provided at this stage, but project is still ongoing) • Feedback is welcome!!!!!!!! steven.krauwer@elsnet.org

  29. Written resources (1) • Lexicon: • For all components: 40 000 stems with POS & morphology • For sentence boundary detection: list of conjunctions and other sentence starters/stoppers • For named entity recognition: 50 000 human proper names • For semantic analysis: same 40 000, with subcategorization, shallow lexical semantic info; possibly a WordNet steven.krauwer@elsnet.org

  30. Written resources (2) • Bi-/Multilingual lexicon • Same size as monolingual • Thesauri, ontologies, wordnets: • Thesaurus subtree with ca 200-300 nodes for each domain • Ontologies and wordnets ideally same size as lexicon steven.krauwer@elsnet.org

  31. Written resources (3) • Corpora: • For term extraction: 100 million words unannoteted • For small applications: 0.5 million words annotated • For statistical POS tagger: 1-3 million (ann) • Sentence boundary: 0.5-1.5 million (ann) • Named entity (stat based): 1.5 million (ann) • Term extraction: 100 million (ann) • Co-reference resolution: 1 million (ann) • WSD: 2-3 million (ann) steven.krauwer@elsnet.org

  32. Written resources (4) • Multilingual corpora: • For alignment: 0.5 million (tagged) • Multimodal corpora: • For OCR (printed): ?? • For OCR (hand-written): ?? steven.krauwer@elsnet.org

  33. Spoken resources (1) • Acoustic data: • For dictation: 50-100 speakers, 20 min each, fully transcribed, plus 10 speakers for testing • For telephony: 500 speakers uttering 50 different sentences (speechdat, orientel based) • For embedded speech recognition: data similar to Speecon • For broadcast news transcription: 50-100 hours well-annotated, plus 1000 hours of non-transcribed data; should come with 300 million words of non-annotated written text steven.krauwer@elsnet.org

  34. Spoken resources (2) • Acoustic data (cont’d): • For conversational speech: data similar to CallHome/CallFriends from LDC • For speaker recognition: 500 speakers for training, 3 minutes each, transcribed, plus 100 speakers for testing • For language/dialect identification: data similar to CallFriend, or from Broadcast News (esp for variants of Arabic) • For speech synthesis: male and female speakers, 15 hours, using a read text, phonetically balanced • For formant synthesis: sama as above, with hand-labelled formant steven.krauwer@elsnet.org

  35. Spoken resources (3) • Multimodal corpora: • For lips movement reading: similar to M2VTS, with some 50 faces • Written corpora for speech technologies: • General; 300 million words unannotated, preferably broadcast news or other press and media sources • For phonetic lexicon and language models: 1-5 million words, annotated • For Arabic: vowelized and non-vowelized corpus steven.krauwer@elsnet.org

  36. What next? (1) • Check definition and quantification for completeness and consistency and correct • Try to provide specs for every single item • Try to differentiate between general and Arabic in definitions and specs steven.krauwer@elsnet.org

  37. What next? (2) • For each language: • Take the BLARK definition and specs • Adapt to local conditions • Make a survey of what exists and what has to be made • Find the funds and build the BLARK for your language steven.krauwer@elsnet.org

  38. Prescriptive / descriptive • Prescriptive: • the BLARK definition tells you which ingredients you need • the specification tells you what they should look like • Descriptive: • a BLARK instantiation comes with a description of its components steven.krauwer@elsnet.org

  39. Main beneficiaries (1) • academic and industrial researchers: material to try out ideas and conduct pilot studies • industrial developers: only for generic activities, since specific applications require more user and domain orientation • educators: material for experimental work by students in labs steven.krauwer@elsnet.org

  40. Main beneficiaries (2) • probably not the main languages in Europe (EN, FR, GE) as they are pretty well covered anyway • mostly the languages that are not supported by a strong market (because of small size or poor economy) steven.krauwer@elsnet.org

  41. References • Binnenpoorte et al at LREC 2002 (see also www.elsnet.org/dox/lrec2002-binnenpoorte.pdf • ELRA Newsletter vol 3, n 2, 1998 (see also www.elsnet.org/blark.html) • NEMLAR: see www.nemlar.org for • Arabic BLARK Report • NEMLAR presentation at Cairo conference • Romary & Ide at LREC 2004 (see also www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt) steven.krauwer@elsnet.org

  42. Concluding remarks • The BLARK aims at providing a common definition of the notion ‘minimal set of resources’ • It should help language communities to come closer to the idea of creating an equal playing field, in spite of market forces • It should facilitate porting of expertise • It is necessarily dynamic, as technologies evolve rapidly steven.krauwer@elsnet.org

  43. Thanks! Contact: steven.krauwer@elsnet.org steven.krauwer@elsnet.org

More Related