1 / 19

Workshop on Balkan Language Resources and Tools Thessaloniki, 21.11.2003

Language Resources in the Balkan area Elina Desipri, Maria Gavrilidou, Penny Labropoulou Institute for Language and Speech Processing. Workshop on Balkan Language Resources and Tools Thessaloniki, 21.11.2003. Goal of the survey. to provide a map of the results of National Projects on LRs

shirleyc
Download Presentation

Workshop on Balkan Language Resources and Tools Thessaloniki, 21.11.2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Resources in the Balkan areaElina Desipri, Maria Gavrilidou, Penny LabropoulouInstitute for Language and Speech Processing Workshop on Balkan Language Resources and Tools Thessaloniki, 21.11.2003

  2. Goal of the survey • to provide a map of the results of National Projects on LRs • to identify • the existing resources, • the gaps that need to be filled in, • the R&D priorities • to harmonize their descriptions, and finally, • to lead to a common meta-data schema for their description. Workshop on Balkan Language Resources and Tools

  3. Broader scope • observed fact: most resources are produced by combinations of funding sources • decision: to broaden the scope of the Survey, in order to include activities also funded by other bodies(European, National, internal [i.e. own funds],industrial funding) Workshop on Balkan Language Resources and Tools

  4. Questionnaire The survey was conducted by a questionnaire covering • Written corpora • Lexica • Spoken corpora • Multimodal corpora • Tools Workshop on Balkan Language Resources and Tools

  5. Design of the questionnaire • Two basic units: • the description of the producing organization, and • the description of the resource itself. Each resource description is divided into the following blocks: • External information • Administrative data • Creation data • Distribution data • Internal information • Resource data (refers to the resource as a whole) • Document data (refers to each unit of the resource) Workshop on Balkan Language Resources and Tools

  6. Resource production cycle The questionnaire covers the stages of resource • production, • validation, • distribution. Workshop on Balkan Language Resources and Tools

  7. Goal of the description schema to cover • the point of view of the LR producers: which elements provide the most accurate description of the resource? and • the point of view of the prospective users: which elements constitute the most informative data that would facilitate the formulation of queries in order to identify the most appropriate resource which meets their needs? Workshop on Balkan Language Resources and Tools

  8. Focusing on Balkan LRs • The questionnaire was sent to 134 organizations • 13 of which are located in Balkan countries • 4 from Greece, 1 from Croatia, 1 from Romania and 1 from Bulgaria • Data for 66 distinct resources was collected. Workshop on Balkan Language Resources and Tools

  9. Background • Indentification of LR actors based on • Previous surveys conducted by ELRA/ELDA, Euromap, ELSNET etc. • Web sites i.e. HLTCentral, Content Village, Linguist and Corpora List etc. • Projects focusing on Balkan LRs i.e. TRACTOR/TELRI, MULTEXT-East etc. Workshop on Balkan Language Resources and Tools

  10. General Comments – types and languages • The majority of the LRs are monolingual • Balkan organizations provide LRs for their own languages mainly • multilingual resources include also major European languages • Very few spoken resources have been identified and no multimodal ones. Workshop on Balkan Language Resources and Tools

  11. General comments – language coverage • comparable percentage of domain specific (47%) and general language (39%) LRs • multilingual resources are mostly domain specific due to the facts that: • multilingual domain specific LRs are the product of application specific projects • national funding usually yields general language monolingual resources Workshop on Balkan Language Resources and Tools

  12. General remarks - funding • most of the resources are funded by national authorities • which leads mostly to the production of generallanguage LR’s for national languages • industrial funding is almost absent Workshop on Balkan Language Resources and Tools

  13. Remarks on availability • LR’s are available predominantly for research • however, increasing tendency in making LR’s available for commercial use, which should be re-inforced more emphasis on user needs from the producers’ side more relaxed restrictions from the sources’ side • LR’s mostly distributed by producers themselves Workshop on Balkan Language Resources and Tools

  14. Identification of existing LRs • Well-known fact: there exists a large number of LR producers • However, it is difficult to locate them (and to convince them to fill in questionnaires…) and • to examine whether a resource is suitable for a specific need Workshop on Balkan Language Resources and Tools

  15. The ENABLER information site : • provides a catalogue of LRs with the collected information, • provides the users with a search facility for accessing the data collected, facilitating (hopefully) the process of LRs identification, • provides the LRs producers with an updating facility for the constant enrichment and updating of this catalogue, • disseminates the results of the survey and finally, • promotes awareness on LRs. Workshop on Balkan Language Resources and Tools

  16. ENABLER network service • mechanism aiming to aid prospective users with the identification of LRs • NOT to obtain the resources themselves, • but to acquire information on the resource and the contact person details • URL: http://www.ilsp.gr/enabler Workshop on Balkan Language Resources and Tools

  17. INTERA project • eContent project • To build an integrated European Language resource area by connecting LRs data centers • To produce new multilingual resources • Focusing on less widely spoken languages (obviously Balkan, eastern and southern European languages are included) Workshop on Balkan Language Resources and Tools

  18. INTERA project multilingual resources • Parallel corpora • Terminological resources • On the domains of • Health • Education • Law, and • Tourism Workshop on Balkan Language Resources and Tools

  19. Co-operation sought • Provision of LRs (level of processing) • Production of parallel corpora • Processing of corpora • Validation of resources Workshop on Balkan Language Resources and Tools

More Related