150 likes | 279 Views
Phrase alignment of Estonian-German parallel treebanks. Heli Uibo and Krista Liin, University of Tartu Martin Volk, Stockholm University . Aim and motivation. Aim – the alignment of the phrases of two corpora that are each others' translations Motivation:
E N D
Phrase alignment of Estonian-German parallel treebanks Heli Uibo and Krista Liin, University of Tartu Martin Volk, Stockholm University
Aim and motivation • Aim – the alignment of the phrases of two corpora that are each others' translations • Motivation: • Example-Based Machine Translation (EBMT) • Cross-language and translation studies
Existing resource – The Sofie Parallel Treebank • http://omilia.uio.no/sofie/ (password protected) • 9 European languages, including German and Estonian • initiated by the Nordic Treebank Network • chapters 1-2 of Jostein Gaarder’snovel “Sophie’s World” • sentences aligned • syntactic structure and functions annotated, but different annotation schemes used: • German – TIGER (http://www.ims.uni-stuttgart.de/projekte/TIGER/ ) • Estonian – VISL (http://beta.visl.sdu.dk)
Automatic alignment of Estonian-German NPs • This is the first automatic alignment of Estonian-X parallel corpora below the sentence level. • We started from the automatic alignment of NPs, because • an important part of the sentence's meaning is represented by noun phrases; • NPs are the most frequent phrase types in these languages.
The NP alignment method 1. Find all noun phrases in the parallel sentences. Sofie legte dann immer einen dicken Stapel Post auf den Küchentisch, ehe sie auf ihr Zimmer ging , um ihre Aufgaben zu machen . Tavaliselt pani tapaksu pataka postiköögilauale , enne kui läks üles oma tuppakoolitöid tegema .
The NP alignment method 2. Find all correspondences between the noun phrases. Sofie legte dann immer einen dicken Stapel Post auf den Küchentisch, ehe sie auf ihr Zimmer ging , um ihre Aufgaben zu machen . Tavaliselt pani tapaksu pataka postiköögilauale , enne kui läks üles oma tuppakoolitöid tegema . 3. Remove overlapping correspondences.
The NP alignment method To accomplish 2.-3. we used online dictionaries (ET-EN and DE-EN) and annotation information: 2. To set the correspondences between Estonian and German NPs • Translate all NP heads to English; • Find the intersections of translations; • If a pair of NPs are related by translation, then set a correspondence between them. 3. To remove overlapping correspondences • Use proper names as milestones; • Look at the locations of the NPs in the sentence.
Results • 53 sentence pairs • 134 possible NP matches were found, out of which 75 matches were selected. • precision 84% • recall 53%
Sources of errors • Different tree structures (German – deeper) • Translation problems. We used English as an intermediary language to find German-Estonian word correspondences (there is no free German-Estonian electronic dictionary). • An NP in one language may correspond to a different phrase type or to a part of an NP in the other language. • A PP in German often corresponds to an NP in Estonian • A lot of grammatical information that is expressed by prepositions in German or English is expressed by grammatical cases in Estonian.
Alternative approach – statistical • An alternative to using bilingual electronic dictionaries is the use of statistical word alignment methods. • This method has been evaluated by Samuelsson (2004) for the phrase alignment of a German-Swedish parallel treebank. • We intend to test this method also for a German-Estonian treebank, although we are aware of the structural differences between German and Estonian which make automatic word alignment more difficult.
Treebank tools • There exist tools for monolingual treebanks: • editors, e.g. Annotate • treebank query tools (tgrep, TIGERSearch) • Special software tools for building and using of parallel treebanks are needed. • We have developed an alignment viewer based on SVG (Scalable Vector Graphics). • Need to implement: • alignment editor (currently being developed at Stockholm University) • phrase alignment test tool
Alignment visualization: Index file in HTML Tree overview
Conclusion and perspectives • Our first attempt to align the noun phrases in the Estonian-German parallel treebank led to satisfactory results. • The results could be improved if • different phrase types would be taken into consideration; • a more exact dictionary could be used; • Estonian syntactic trees would be deepened, making their annotation depth more similar to that of the German trees.