The METIS Project

The METIS Project Peter DirixJuly 8, 2002 Centre for Computational Linguistics Katholieke Universiteit Leuven

The METIS Project • EU-sponsored project • Statistical machine translation • Partners: ILSP (Athens), KU Leuven - CCL • Subcontractors: University of Antwerp, KUB (Tilburg)

The METIS Project • MT = Holy Grail of computational linguistics • Since 50s: word-by-word systems • Later: rule-based systems • But: bottleneck reached since long • Since 80s: SMT, new techniques

The METIS Project • Disadvantage SMT: need large bilingual corpora (bitexts, usually not available) • METIS: use only large monolingual corpora (widely available) • Furthermore, you need bilingual lexicon and tag-mapping rules • Only minimal effort for new language pair

The METIS Project • First language pairs: Dutch-English and Modern Greek-English • In this internship: creation of lexical resources for Dutch and British English • This means: monolingual corpora for Dutch and English and a bilingual lexicon Dutch-English • Resources conform to PAROLE/EAGLES standards • Creation of tag-mapping rules

Corpora: Dutch • No extensive written Dutch corpus • Take parts of Corpus Spoken Dutch (CGN) consisting of read-aloud written text • Add written-text parts of Eindhoven corpus (newspaper texts of 60s and 70s) • Tilburg corpus (recent newspaper texts) is not available

Corpora: Dutch • Together: - CGN: 1,580,000 words (out of 10 million) - Eindhoven: 600,000 words (out of 720,000) • CGN has CGN tag set, Eindhoven has WOTAN tagset

Corpora: English • British National Corpus (BNC) is largest available text corpus for British English • About 100 million words • Tagged with CLAWS5 tagset • About 2 million words get enriched tagset (CLAWS6) • Very good tagging quality

Bilingual Lexicon: The Search • Criteria: correctness, generality, availability, cost, number of words • Our choice: combination • Dutch EuroWordNet & Ergane

Dutch EuroWordNet • Entry not given per word, but per synset (set of synonymous words) • About 45,000 synsets • Gives language-internal (semantic) relations, part of speech and equivalence link (translation) to American WordNet 1.5 • Fairly cheap (about 440 €)

Ergane • Multilingual Internet dictionaries • Uses Esperanto as interlingua • Dutch-English pair was available on the net • Contains about 50,000 translations • Free

Corpora • CGN: only punctuation needs to be reinserted • Eindhoven corpus: will be retagged with CGN tags and lemmatized • BNC: needs to be lemmatized • Tasks will be performed by Antwerp/Tilburg group

Format of bilingual lexicon • An Excel format was agreed upon • But, lexicon too big (Excel only allows 64K lines) • So text file with 3 fields per line (with Dutch lemma, English translation - only one per line - and PoS • Fields separated by tabs

Dutch EuroWordNet • Extract information from WordNet files, using Perl scripts • Two WN files needed: the Dutch WordNet (DWN) and the Interlingual Index (ILI) • DWN refers to ILI, using eq_synonym and eq_near_synonym links to translations • Information of both lists was combined, using Perl scripts • PoS is also extracted from DWN • File in text format of target dictionary, about 100,000 lines

Ergane • Contains information in this form:aanbesteding: 1. tender | 2. public tender | 3. tender | 4. tender | 5. tender<BR> • Contains HTML tags: removed by Perl script; same for colons, numbers and bars • Each translation put in different entry • PoS is automatically assigned: n • File in text format of target dictionary, about 50,000 lines

Compiling one lexicon • Two lexica were merged into one file • Unix command-line program sort was used to put the list into alphabetical order and to remove duplicate entries • File with about 117,000 lines • Typos were corrected manually • Wrong translations were deleted

Compiling one lexicon • PoS was corrected manually, also the ones introduced in Ergane • Collocations were removed to separate file (PoS determined by use) • Difference in PoS between lexicon and CGN will be handled later in the project • Complete lexicon covers 115,756 lines

Tag-mapping rules • CGN tags purely on a word basis • Lemmatization to base form • Tag = list of lexical and morpho-syntactic features • Includes always PoS

Tag-mapping rules • BNC: CLAWS6 tagset is chosen • Also tagset on grammatical basis, but includes some semantics (e.g. name of months, …) • More general tag subsumes less general one

Tag-mapping rules • For each PoS category, map features and values from Dutch to English • E.g.: N(eigen,mv,*) NP, NP2, NPD2, NPM2 • 74 rules were constructed, sometimes to multiple-tag categories in English • Not implemented yet, because MATLAB environment was not ready yet

Conclusion • Lexical resources and tag-mapping rules needed for METIS were constructed • Not easy to get appropriate resources • Problems in the future: * generality of tag-mapping rules * adjacency of collocations and separable verbs in Dutch

The METIS Project