350 likes | 453 Views
Human Judgements in Parallel Treebank Alignment. Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch. English Syntax Tree. DE – EN Alignment. SMULTRON. S tockholm MUL tilingual TR eebank 1000 sentences in 3 languages (DE-EN-SV)
E N D
Human Judgements in Parallel TreebankAlignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch
DE – EN Alignment
SMULTRON • Stockholm MULtilingual TReebank • 1000 sentences in 3 languages (DE-EN-SV) • 500 from Jostein Gaarder’s Sophie’s World(~ 7 500 tokens, 14 tokens/sentence) and • 500 from Economy texts (~ 11 000 tokens, 22 tokens/sentence) • ABB Quarterly report • Rainforest Alliance: Banana Certification Program • SEB Annual report • Released: January 2008 www.ling.su.se/dali/research/smultron/index.htm
English annotation • Follows the Penn Treebank guidelines • Slower annotation because of • insertion of traces • secondary edges • deeper trees
Sentence alignment • Word alignment • input for Statistical MT • Phrase alignment • linguistically motivated phrases • input for Example-based MT
Tools for Parallel Treebanks • creating and editing trees • from mono-lingual treebanks • PoS-taggers, chunkers, editor, ’tree-enricher’ • aligning phrases • use of word alignment tools • tree alignment editor Stockholm TreeAligner • searching across languages • TIGER-Search for parallel treebanks Stockholm TreeAligner
Guidelines for Alignment • Align words and phrases that represent the same meaning and could serve as translation units in an MT system. • Align as many words and phrases as possible. • Distinguish between exact and approximate alignments. • 1:n word / phrase alignments are allowed, but not m:n word / phrase alignments. • m:n sentence alignments are allowed.
Examples • Do not align: • die Verwunderung über das Leben • their astonishment at the world • Do align: • was für eine seltsame Welt • what an extraordinary world
Specific rules • a pronoun in one language shall never be aligned with a full noun in the other • names are aligned regardless of spelling, unless the name is changed (fiction) • ignore number/case but not voice
Exact vs approximate alignment • best vs. ”second-best” translation • an acronym in one language shall be aligned as approximate (fuzzy) with a spelled-out term in the other • PT – Power Technologies • difficult distinctions • einer der ersten Tage im Mai – early May
Related Research • Blinker project (Melamed) • Prague Czech-English Treebank • Example-based MT in Dublin • Linköping English-Swedish Treebank
Experiment • 12 students to align 20 tree pairs DE-EN • 10 tree pairs from Sophie’s world • 10 tree pairs from Economy text • advanced CL students • received • short introduction • the written guidelines
Experiment: Results The students created • a huge variety in number of alignments • Sophie part: from 47 to 125 (ø = 94.3) • Econ part: from 62 to 259 (ø = 186.9) the 3 students with the lowest numbers were non-native speakers of German 1 student had misunderstood the task
Experiment: Results • The remaining 8 students had a high overlap with the gold standard (Recall): • Sophie part: from 48% to 81% (ø = 68.7%) • Econ part: from 66% to 89% (ø = 75.5%) • Precision • Sophie part: from 81% to 97% (ø = 89.1%) • Econ part: from 78% to 94% (ø = 88.2%)
Discrepancies • students sometimes aligned a word (or some words) with a node. • e.g. the word natürlichto the phrase of course • students sometimes aligned a German verb group with a single verb form in English • e.g. ist zurückzuführenvs. reflecting
Discrepancies based on different grammatical forms: • a definite single NP in German with an indefinite plural NP in English • der Umsatz vs. revenues • a German genitive NP with a PP in English • der beiden Divisionenvs. of the two divisions
Missed by all students • alignment of German word to empty token in English • wenn sie die Hand ausstreckte vs. • herself shaking hands
Conclusions • Our alignment guidelines are sufficient for a core of clear alignment decisions. • Needed: • Better alignment rules with concrete examples. • Better support tools (consistency checking). • The distinction between exact alignment and approximate alignment is very tricky.
Thank You for Your Attention! • Questions???
Applications of Parallel Treebanks For the Translator • corpus for translation studies • search tools needed For the Computational Linguist • input for Example-based Machine Translation • evaluation corpus for word, phrase or clause alignment • training corpus for transfer rules
Parallel Treebanking SV sentence DE sentence ANNOTATE - PoS tagger (STTS) - Chunker (TIGER) PoS tagger (SUC) STTS conversion ANNOTATE - Chunker (SWE-TIGER) flat DE tree flat SV tree Deepening Deepening + Back conv. DE tree SV tree phrase alignment