190 likes | 202 Views
Language Resources in the Balkan area Elina Desipri, Maria Gavrilidou, Penny Labropoulou Institute for Language and Speech Processing. Workshop on Balkan Language Resources and Tools Thessaloniki, 21.11.2003. Goal of the survey. to provide a map of the results of National Projects on LRs
E N D
Language Resources in the Balkan areaElina Desipri, Maria Gavrilidou, Penny LabropoulouInstitute for Language and Speech Processing Workshop on Balkan Language Resources and Tools Thessaloniki, 21.11.2003
Goal of the survey • to provide a map of the results of National Projects on LRs • to identify • the existing resources, • the gaps that need to be filled in, • the R&D priorities • to harmonize their descriptions, and finally, • to lead to a common meta-data schema for their description. Workshop on Balkan Language Resources and Tools
Broader scope • observed fact: most resources are produced by combinations of funding sources • decision: to broaden the scope of the Survey, in order to include activities also funded by other bodies(European, National, internal [i.e. own funds],industrial funding) Workshop on Balkan Language Resources and Tools
Questionnaire The survey was conducted by a questionnaire covering • Written corpora • Lexica • Spoken corpora • Multimodal corpora • Tools Workshop on Balkan Language Resources and Tools
Design of the questionnaire • Two basic units: • the description of the producing organization, and • the description of the resource itself. Each resource description is divided into the following blocks: • External information • Administrative data • Creation data • Distribution data • Internal information • Resource data (refers to the resource as a whole) • Document data (refers to each unit of the resource) Workshop on Balkan Language Resources and Tools
Resource production cycle The questionnaire covers the stages of resource • production, • validation, • distribution. Workshop on Balkan Language Resources and Tools
Goal of the description schema to cover • the point of view of the LR producers: which elements provide the most accurate description of the resource? and • the point of view of the prospective users: which elements constitute the most informative data that would facilitate the formulation of queries in order to identify the most appropriate resource which meets their needs? Workshop on Balkan Language Resources and Tools
Focusing on Balkan LRs • The questionnaire was sent to 134 organizations • 13 of which are located in Balkan countries • 4 from Greece, 1 from Croatia, 1 from Romania and 1 from Bulgaria • Data for 66 distinct resources was collected. Workshop on Balkan Language Resources and Tools
Background • Indentification of LR actors based on • Previous surveys conducted by ELRA/ELDA, Euromap, ELSNET etc. • Web sites i.e. HLTCentral, Content Village, Linguist and Corpora List etc. • Projects focusing on Balkan LRs i.e. TRACTOR/TELRI, MULTEXT-East etc. Workshop on Balkan Language Resources and Tools
General Comments – types and languages • The majority of the LRs are monolingual • Balkan organizations provide LRs for their own languages mainly • multilingual resources include also major European languages • Very few spoken resources have been identified and no multimodal ones. Workshop on Balkan Language Resources and Tools
General comments – language coverage • comparable percentage of domain specific (47%) and general language (39%) LRs • multilingual resources are mostly domain specific due to the facts that: • multilingual domain specific LRs are the product of application specific projects • national funding usually yields general language monolingual resources Workshop on Balkan Language Resources and Tools
General remarks - funding • most of the resources are funded by national authorities • which leads mostly to the production of generallanguage LR’s for national languages • industrial funding is almost absent Workshop on Balkan Language Resources and Tools
Remarks on availability • LR’s are available predominantly for research • however, increasing tendency in making LR’s available for commercial use, which should be re-inforced more emphasis on user needs from the producers’ side more relaxed restrictions from the sources’ side • LR’s mostly distributed by producers themselves Workshop on Balkan Language Resources and Tools
Identification of existing LRs • Well-known fact: there exists a large number of LR producers • However, it is difficult to locate them (and to convince them to fill in questionnaires…) and • to examine whether a resource is suitable for a specific need Workshop on Balkan Language Resources and Tools
The ENABLER information site : • provides a catalogue of LRs with the collected information, • provides the users with a search facility for accessing the data collected, facilitating (hopefully) the process of LRs identification, • provides the LRs producers with an updating facility for the constant enrichment and updating of this catalogue, • disseminates the results of the survey and finally, • promotes awareness on LRs. Workshop on Balkan Language Resources and Tools
ENABLER network service • mechanism aiming to aid prospective users with the identification of LRs • NOT to obtain the resources themselves, • but to acquire information on the resource and the contact person details • URL: http://www.ilsp.gr/enabler Workshop on Balkan Language Resources and Tools
INTERA project • eContent project • To build an integrated European Language resource area by connecting LRs data centers • To produce new multilingual resources • Focusing on less widely spoken languages (obviously Balkan, eastern and southern European languages are included) Workshop on Balkan Language Resources and Tools
INTERA project multilingual resources • Parallel corpora • Terminological resources • On the domains of • Health • Education • Law, and • Tourism Workshop on Balkan Language Resources and Tools
Co-operation sought • Provision of LRs (level of processing) • Production of parallel corpora • Processing of corpora • Validation of resources Workshop on Balkan Language Resources and Tools