210 likes | 504 Views
Alinea : a language independant tool for bi-text processing . Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop. Introduction. ► Introduction Application Features I/O formats Language specificities. Alinea is an aligning tool that uses language- independent techniques
E N D
Alinea : a language independant tool for bi-text processing Jean-Louis Duchet, Olivier Kraif Ispra 2005 – JRC Workshop
Introduction ►Introduction Application Features I/O formats Language specificities • Alinea is an aligning tool that uses language- independent techniques • Alinea has obtained good results on closely related language pairs : EN, FR, ES, IT, … -> Is it possible to use it for languages further apart ? -> What kind of tuning is involved when dealing with a new language pair ? -> What kind of language-specific knowledge could be used in order to improve the results provided?
A corpus-based bilingual dictionary Introduction ►Application Features I/O formats Language specificities • Corpus being scanned: Ismail Kadare’s, published in both languages in Paris (Ed. Fayard), other sources: IIRCA (International initiative for a reference corpus of Albanian) • Indexing used to retrieve word forms NOT yet recorded in dictionaries • Concordancing to enlarge the phraseological content of the dictionary • Aligned concordancing used to correlate acceptions in context in the two languages
Dictionary in the making: sample Introduction ►Application Features I/O formats Language specificities
Items not yet recorded Introduction ►Application Features I/O formats Language specificities Why? - variants - foreign loanwords - local colour terms - compounds Examples in letter dh • dhimbje (var.) • dhimbsur (var.) • dhimbsuri (var.) • dhjavolos (loanword) • dhjetra • dhrahmi • dhëmbëjashtë (comp)
Specific features of the language pair Introduction ►Application Features I/O formats Language specificities • The Albanian « phonetic principle »: Albanian script converts foreign words: shofer/chauffeur, konti/comte, incl. proper nouns: Nju-Jork/New York, Ballkan/Balkans; • The French graphemic preservation principle: Gjergj Balsha, Gjin Bue Shpata
French-Albanian stoplist Introduction ►Application Features I/O formats Language specificities • Stoplist based on most frequent words
Albanian alphabetical order Introduction ►Application Features I/O formats Language specificities • A, B, C, Ç, D, DH, E, Ë, F, G, GJ, H, I, J, K, L, LL, M, N, NJ, O, P, Q, R, RR, S, SH, T, TH, U, V, X, XH, Y, Z, ZH • 36 letters: 29 consonants, 7 vowels, 9 digraphs and 2 letters with diacritics count as separate graphemic unit
Alinea features Introduction Application ►Features I/O formats Language specificities Aligning in three steps • Anchor point extraction • Full sentence alignment • Lexical correspondences extraction
Alinea features Introduction Application ►Features I/O formats Language specificities • Step 1 : Anchor point extraction • Relies on identical chains (transparent words -- Fr. transfuges) : numbers, proper nouns, other such chains. • Implements a "safest clues first" heuristic within an iterative framework • Usually yields precision close to 100%, and recall over 10%.
Alinea features Introduction Application ►Features I/O formats Language specificities • After identical chains, cognate pairs can be used to supply further anchor points Il y avait plusieurs années qu ' on avait planté de tels écriteaux un peu partout , non seulement dans les possessions de notre seigneur , le comteStres des Gjika , ou StresGjikondi , mais aussi plus loin , au - delà des frontières de l ' État d ' Arberie , dans les autres contrées des Balkans . Ka shumë vite që kësi pllakash janë venë kudo dhe jo vetëm në viset e kryezotit tonë , kontitStres të Gjikëve , ose StresGjikondit , siç e thërresin shkurt , por edhe më tutje , madje edhe përtej kufijve të shtetit të Arbrit , në pjesët e tjera të gadishullit .
Alinea features Introduction Application ►Features I/O formats Language specificities • Step 2 : Full alignment computation • Extracts a sequence of sentence grouping: (1-0) (0-1) (1-1) (1-2) (2-1) (1-3) (3-1) … • Uses a combination of various clues: • sentence lengths (Gale & Church, 1992) • cognateness (Simard, 1992) • word to word correspondences (requires training from a large corpus)
Alinea features Introduction Application ►Features I/O formats Language specificities • Step 3 : Lexical correspondence extraction • Extracts word to word correspondences (except for words in the stoplist) • Requires a large amount of parallel texts (>500 000 words) in order to compute reliable statistics • Takes into account a combination of clues: • word positions • cognateness • distributions across the training corpus -> Has obtained more than 90% of precision and recall on a literary corpus (Kraif & Chen, Coling 2004)
Introduction Application ►Features I/O formats Language specificities 3 steps I. Anchor points II. Full alignment III. Lexical correspondances
Bi-text browsing and edition Introduction Application ►Features I/O formats Language specificities
Input / output format Introduction Application Features ►I/O formats Language specificities • Input files • raw texts (Iso-Latin-1, UTF-8) • cesAna texts with sentence segmentation • xml tagged texts • cesAlign • Output files • kwic • aligned raw texts • cesAlign • html
Alinea features Introduction Application ►Features I/O formats Language specificities • Bilingual concordancer • Implements queries using xml tags and regular expressions at token level. • Example (using tagged corpora) : to search the verb être as an auxiliary followed by a past participle (French passé composé) : <base=être,ctag=verb,msd=vaux><>?<ctag=paprt>
Language specific knowledge Introduction Application Features I/O formats ►Language specificities • Minimal tuning • language pair -> sentence length average ratio • Language specific knowledge is optional • stoplists to eliminate function words and false friends (faux-amis) • occurrence/cooccurrence statistics for lexical correspondence extraction • forthcoming : bilingual lexicon
References about Alinea Kraif O., Chen B. (2004) Combining clues for lexical level aligning using the Null hypothesis approach, in Proceedings of Coling 2004, Geneva, August 2004, pp. 1261-1264. Kraif O. (2001) Exploitation des cognats dans les systèmes d’alignement bi-textuel : architecture et évaluation, TAL 42 :3, ATALA, Paris, pp. 833-867. Kraif O. (2001) Constitution et exploitation de bi-textes pour l’Aide à la traduction, PhD dissertation, dir. by Henri Zinglé, Université de Nice Sophia Antipolis, http://www.u-grenoble3.fr/kraif Kraif O. (2000) Evaluation of statistical measures for automatic extraction of French-English bilingual lexicons, in Proceedings of Comlex 2000, Patras, Greece, 22-23 september 2000, pp. 134-144 Alinea is distributed freely for research purposes. Please contact : kraif@u-grenoble3.fr