1 / 24

Eurovoc does not yet exist for your language? The Hungarian experience.

Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu. Overview of the project. Objectives Partners Resources Methods Results Conclusions. Project objectives. Hungarian EUROVOC version only a draft version planned at first

gavin-hale
Download Presentation

Eurovoc does not yet exist for your language? The Hungarian experience.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

  2. Overview of the project • Objectives • Partners • Resources • Methods • Results • Conclusions EUROVOC Indexing Workshop

  3. Project objectives • Hungarian EUROVOC version • only a draft version planned at first • an authorative full-scale system • Automatic indexing of documents • using the technology developed at JRC • prototype system for one domain EUROVOC Indexing Workshop

  4. Partners • Project consortium: • HAS RIL (coordinator) • MorphoLogic Kft. (partner) • Collaborators: • JRC, Ispra • Hungarian Parliament • Ministry of Justice EUROVOC Indexing Workshop

  5. Resources • NLP toolset (RIL) • Digital dictionaries, software technology (MorphoLogic) • Indexing technology (JRC Ispra) • Terminology database, translation, supervision expertise (Justice Ministry) • Coordination funding of Hungarian EUROVOC (Hungarian Parliament) EUROVOC Indexing Workshop

  6. EUROVOC translation • Done by the Translation Coordination Unit of the Ministry of Justice • Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire • Maintaining an online Terminological Database EUROVOC Indexing Workshop

  7. Terminological Database EUROVOC Indexing Workshop

  8. Translation process • English, French, German & Spanish EUROVOC versions in xml files • Automatic lookup of Terminological Database (cc. 20% coverage) • Notepad2 xml-aware editor used • micro-thesauri translated first, corresponding descriptors second • pool of experts consulted when needed EUROVOC Indexing Workshop

  9. Indexing strategies • Corpus: Hungarian translation of Acquis Communitaire • Two approaches • To translate English associate terms (possible short-cut?) • To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data EUROVOC Indexing Workshop

  10. Translation of associate terms • Hypothesis: • relation between English associate term and EUROVOC descriptor is language independent • hence Hungarian equivalent of English term will also serve as appropriate associate term in Hungarian texts EUROVOC Indexing Workshop

  11. Online dictionary lookup • MorphoLogic Online English-Hungarian dictionaries applied • 24.7 % direct match <LIBELLE_EN>suspension of payments</LIBELLE_EN> <LIBELLE_DE>Zahlungseinstellung</LIBELLE_DE> <LIBELLE_FR>cessation de paiement</LIBELLE_FR> <LIBELLE_ES>suspensión de pagos</LIBELLE_ES> <LIBELLE_HU>kifizetések felfüggesztése</LIBELLE_HU> EUROVOC Indexing Workshop

  12. Manual check of automatic assignments • Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts • the Hungarian terms must be looked up in the translation corpus as well • parallel corpus aligned at least on the document level must be compiled EUROVOC Indexing Workshop

  13. Manual check • Even frequency lists are useful: <LIBELLE_EN>sales promotion</LIBELLE_EN> <LIBELLE_DE>Absatzförderung</LIBELLE_DE> <LIBELLE_FR>promotion commerciale</LIBELLE_FR> <LIBELLE_ES>promoción comercial</LIBELLE_ES> <LIBELLE_HU>eladásösztönzés</LIBELLE_HU> Reklám 149Promóció 60Eladásösztönzés 1 EUROVOC Indexing Workshop

  14. Manual check • Even frequency lists are useful: <LIBELLE_EN>toxic substance</LIBELLE_EN> <LIBELLE_DE>Giftstoff</LIBELLE_DE> <LIBELLE_FR>substance toxique</LIBELLE_FR> <LIBELLE_ES>sustancia tóxica</LIBELLE_ES> <LIBELLE_HU>toxikus anyagok</LIBELLE_HU> <LIBELLE_HU>mérgező anyagok</LIBELLE_HU> Equally frequent EUROVOC Indexing Workshop

  15. Generation of Hungarian associate-lists • Tasks • Compile corpus of Hungarian translation of Acquis Communitaire • Tag and lemmatize words • Compile list of stop words • Run automatic indexing tools (JRC) EUROVOC Indexing Workshop

  16. Hungarian Acquis Communautaire corpus • 8308 files <!ELEMENT document (title+,text,lemmatised, descriptors,description) > HUN tokens 21,899,924 EN tokens 20,394,088 EUROVOC Indexing Workshop

  17. English stop-word list • English stop word list: 1720 items • function words • "EUspeak" • objective, arrangements, committee • Some strange multiword strings necessary_to_comply_with_this_directive forward_this_resolution_to_the_commission EUROVOC Indexing Workshop

  18. Hungarian stop-word list • translated English items • checked their occurrence in HU CELEX • generated unigram,bigram and trigram frequency lists from HU CELEX corpus • checked first 3000 items on each list and added to the stwd list if needed • double checked infrequent items on English translation list and replaced translation with synonyms EUROVOC Indexing Workshop

  19. Hungarian stop-word list single word entries 1265 multi-word entries 752 Total 2017 EUROVOC Indexing Workshop

  20. Automatic indexing run 1 7971 texts divided into 3 sets:(total length of 65702474 chars) • 202 optimisation (evaluation set) • 179 final evaluation (test set) • 7590 the training set EUROVOC Indexing Workshop

  21. Precision/recall in terms of number of Eurovoc descriptors EUROVOC Indexing Workshop

  22. Evaluation in terms of rank EUROVOC Indexing Workshop

  23. Precision/Recall graph : EUROVOC Indexing Workshop

  24. Conclusions • First run already yields results comparable to other languages • scope for fine-tunig/filtering process • interesting to compare results gained from the two approaches EUROVOC Indexing Workshop

More Related