230 likes | 239 Views
Explore how aligning multilingual text fragments can result in efficient text compression, with detailed algorithms, results from European parliament texts, and future research directions.
E N D
Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein
Outline • Multilingual text • Problem definition • Multilingual-text alignment • Compression of multilingual texts using alignment • Algorithm • Results • Future work
Multilingual text • Same contents in two or more (natural) languages • Legislative texts of the European Union in all EU languages Subject: Supplies of military equipment to Iraq Objet: Livraisons de matériel militaire à l’Irak
Problem definition • How can multilingual texts be compressed more efficiently relative to compression of each language separately? • Can semantic equivalence be exploited to reduce aggregate corpus size?
Multilingual-text alignment (1) • Mapping of equivalent text fragments to each other • Paragraph/sentence and word/phrase levels • Algorithms for both levels • Tokenization, lemmatization, shallow parsing • Alignment possibly partial
Linear alignment • Given two parallel fragments S and T, the linear alignment of a token tjin T is the token siin S such that:
Offset from linear alignment • Signed distance between correct and linear alignments • Usually very small values (mostly [-10, 10])
Compression of multilingual texts using alignment:Basic idea (1) • Compress by replacing words/phrases with pointers to their translations within the other text • Original text restored using bilingual dictionary • Store offsets relative to linear alignment • Small values small number of values efficient encoding
Compression of multilingual texts using alignment:Basic idea (2) • Store number of words in pointed fragment • Might be a multi-word phrase • bilan balance sheet • Single pointer may replace multi-word phrase • matériel militaire pointer to military equipment • chemin de fer railway
Basic scheme: Example (option 1) • Prefixes: 0 - word, 1 - pointer • 1(offset, length)
Basic scheme: Example (option 2) • matériel militaire pointer to military equipment • Offset relative to first words
Complication: Words withmultiple possible translations • Sometimes more than one possible translation per word • equipment 1. équipement 2. matériel • Must encode correct translation within pointer • Store index of translation
Complication:Morphological variants (1) • Bilingual dictionary must use one morphological form (lemma) • go aller stands for:{go, went, gone, going} {aller, vais, vas, va etc.}
Complication:Morphological variants (2) • Texts include inflected forms • More than one possible lemma(bound {bind, bound}) must indicate correct lemmas for S to enable dictionary lookup • Several variants per lemma must indicate correct inflections of translation wordsto enable restoration of T
Complication: Morphological variants (3) • lower bound • borne inférieure • 1(1,1,0,2,0) 1(-1,1,0,4,1) • borne inférieure • 1(offset, length, lemma(s), translation, variant(s)) • Multiple values for multiple words
Optimizations • No encoding for single option • Relevant for all 3 dictionaries • Sort options by descending order of frequencies • Large number of small values better encoding • Encode length as (length – 1) • length never 0
Binary encoding (1) • Use 3 Huffman codes • H1: words + pointer prefix • H2: absolute values of offsets • sign bit follows, except for 0 • H3: lengths + indices
Binary encoding (2) • Words: H1(lemma) [H3(variant)] • Pointers: l = length, m = (# of words in translation) H1(ptr_prefix) H2(offset) [sign_bit] H3(l – 1)[H3(lemma0)] … [H3(lemmal - 1)][H3(translation)][H3(variant0)] … [H3(variantm – 1)]
Empirical results • English-French responsa collection of European parliament (ARCADE project) • Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS • Dictionaries exist anyway in large IR systems • Heaps law: Dictionary size is αNβ, where 0.4 β 0.6 • For large corpora, size negligible
Future work • Other test corpora • Other languages • Compress target using lemmatized source • Improve encoding • Bidirectional scheme • Pattern matching within compressed text • Improved model for k languages