1 / 21

The METIS Project

The METIS Project. Peter Dirix July 8, 2002 Centre for Computational Linguistics Katholieke Universiteit Leuven. The METIS Project. EU-sponsored project Statistical machine translation Partners: ILSP (Athens), KU Leuven - CCL Subcontractors: University of Antwerp, KUB (Tilburg).

rhunnicutt
Download Presentation

The METIS Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The METIS Project Peter DirixJuly 8, 2002 Centre for Computational Linguistics Katholieke Universiteit Leuven

  2. The METIS Project • EU-sponsored project • Statistical machine translation • Partners: ILSP (Athens), KU Leuven - CCL • Subcontractors: University of Antwerp, KUB (Tilburg)

  3. The METIS Project • MT = Holy Grail of computational linguistics • Since 50s: word-by-word systems • Later: rule-based systems • But: bottleneck reached since long • Since 80s: SMT, new techniques

  4. The METIS Project • Disadvantage SMT: need large bilingual corpora (bitexts, usually not available) • METIS: use only large monolingual corpora (widely available) • Furthermore, you need bilingual lexicon and tag-mapping rules • Only minimal effort for new language pair

  5. The METIS Project • First language pairs: Dutch-English and Modern Greek-English • In this internship: creation of lexical resources for Dutch and British English • This means: monolingual corpora for Dutch and English and a bilingual lexicon Dutch-English • Resources conform to PAROLE/EAGLES standards • Creation of tag-mapping rules

  6. Corpora: Dutch • No extensive written Dutch corpus • Take parts of Corpus Spoken Dutch (CGN) consisting of read-aloud written text • Add written-text parts of Eindhoven corpus (newspaper texts of 60s and 70s) • Tilburg corpus (recent newspaper texts) is not available

  7. Corpora: Dutch • Together: - CGN: 1,580,000 words (out of 10 million) - Eindhoven: 600,000 words (out of 720,000) • CGN has CGN tag set, Eindhoven has WOTAN tagset

  8. Corpora: English • British National Corpus (BNC) is largest available text corpus for British English • About 100 million words • Tagged with CLAWS5 tagset • About 2 million words get enriched tagset (CLAWS6) • Very good tagging quality

  9. Bilingual Lexicon: The Search • Criteria: correctness, generality, availability, cost, number of words • Our choice: combination • Dutch EuroWordNet & Ergane

  10. Dutch EuroWordNet • Entry not given per word, but per synset (set of synonymous words) • About 45,000 synsets • Gives language-internal (semantic) relations, part of speech and equivalence link (translation) to American WordNet 1.5 • Fairly cheap (about 440 €)

  11. Ergane • Multilingual Internet dictionaries • Uses Esperanto as interlingua • Dutch-English pair was available on the net • Contains about 50,000 translations • Free

  12. Corpora • CGN: only punctuation needs to be reinserted • Eindhoven corpus: will be retagged with CGN tags and lemmatized • BNC: needs to be lemmatized • Tasks will be performed by Antwerp/Tilburg group

  13. Format of bilingual lexicon • An Excel format was agreed upon • But, lexicon too big (Excel only allows 64K lines) • So text file with 3 fields per line (with Dutch lemma, English translation - only one per line - and PoS • Fields separated by tabs

  14. Dutch EuroWordNet • Extract information from WordNet files, using Perl scripts • Two WN files needed: the Dutch WordNet (DWN) and the Interlingual Index (ILI) • DWN refers to ILI, using eq_synonym and eq_near_synonym links to translations • Information of both lists was combined, using Perl scripts • PoS is also extracted from DWN • File in text format of target dictionary, about 100,000 lines

  15. Ergane • Contains information in this form:aanbesteding: 1. tender | 2. public tender | 3. tender | 4. tender | 5. tender<BR> • Contains HTML tags: removed by Perl script; same for colons, numbers and bars • Each translation put in different entry • PoS is automatically assigned: n • File in text format of target dictionary, about 50,000 lines

  16. Compiling one lexicon • Two lexica were merged into one file • Unix command-line program sort was used to put the list into alphabetical order and to remove duplicate entries • File with about 117,000 lines • Typos were corrected manually • Wrong translations were deleted

  17. Compiling one lexicon • PoS was corrected manually, also the ones introduced in Ergane • Collocations were removed to separate file (PoS determined by use) • Difference in PoS between lexicon and CGN will be handled later in the project • Complete lexicon covers 115,756 lines

  18. Tag-mapping rules • CGN tags purely on a word basis • Lemmatization to base form • Tag = list of lexical and morpho-syntactic features • Includes always PoS

  19. Tag-mapping rules • BNC: CLAWS6 tagset is chosen • Also tagset on grammatical basis, but includes some semantics (e.g. name of months, …) • More general tag subsumes less general one

  20. Tag-mapping rules • For each PoS category, map features and values from Dutch to English • E.g.: N(eigen,mv,*) NP, NP2, NPD2, NPM2 • 74 rules were constructed, sometimes to multiple-tag categories in English • Not implemented yet, because MATLAB environment was not ready yet

  21. Conclusion • Lexical resources and tag-mapping rules needed for METIS were constructed • Not easy to get appropriate resources • Problems in the future: * generality of tag-mapping rules * adjacency of collocations and separable verbs in Dutch

More Related