1 / 9

Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language Evaluation Forum 2003. Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin) kluck@bonn.iz-soz.de. Monolingual Task. Languages:

verda
Download Presentation

Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to the Monolingual and Domain-Specific Tasks of the Cross-language EvaluationForum 2003 Michael Kluck (Informationszentrum Sozialwissenschaften – IZ, Bonn/Berlin, Humboldt-University Berlin) kluck@bonn.iz-soz.de CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  2. Monolingual Task • Languages: • Dutch, Finnish, French, German, Italian, Spanish, Swedish • New: Russian (with reduced topic set, because of the time span of the data) • exclusion of English (widely used in TRE etc., overflow of runs; only newcomers) • Aim: • Building a starting-point for CLIR • Enlarge and balance the pool • Use of recently introduced or new languages in the CLEF campaign CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  3. Monolingual runsby 22 participants CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  4. Domain-Specific Task • Amaryllis • could not be continued because of lack of funding in France • trying to get social science data from INIST failed • GIRT • New bigger corpus GIRT4 in German from social science literature and current research information • Parallel corpusin English, although with smaller amount of text compared to the German part CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  5. Features of GIRT4 • Bigger than GIRT3, now: 320,638 documents • 151,319 original German • 151,319 translated into English • Pseudo-parallel corpus: • Title, Controlled-Term, Classification-Text available in German and English for all documents • Abstract available for 96% in German, only for 15 % in English -> reduced amount of text for the English part • Translated texts (Abstract) are sometimes result of machine translation by SYSTRAN (EU) • Renumbered CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  6. Field Availability in GIRT4 • Equal distribution for the German and English part: • Title: 1 per doc • On average: • Controlled-Terms: 10.15 per doc • Classification-Text: 2.02 per doc • Different distribution for the German and English part: • On average: • Method-Term • DE 2.35 per doc • EN 1.93 per doc • Abstract • DE 0.96 per doc • EN 0.15 per doc CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  7. GIRT4 Tasks • Monolingual • DE topics -> DE data • EN topics -> EN data • Bilingual • EN or RU topics -> DE data • DE or RU topics -> EN data • Additional instruments • German-English thesaurus • German-Russian translation table (not fully up-to-date) • Concordance list of document numbers • Will be available by end of August 2003 CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  8. Assessment of GIRT4 • 17,031 docs, +65 % • Started with the German part • Then identified the identical English documents (if they had been indicated as relevant hits) • Continued with those hits in the English part that have been indicated as relevant (without having counterparts in the German part) • During assessment it showed up that the search results in the different language parts have not been fully congruent • For a given topic the result hits in the English part have not been identical with those in the German part (without knowing which was belonging to what run) CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

  9. GIRT4 runs by 4 participants CLEF Workshop ECDL 2003 Trondheim 21.-22.08.2003

More Related