Multilingual Text Compression Using Alignment Techniques

Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

Outline • Multilingual text • Problem definition • Multilingual-text alignment • Compression of multilingual texts using alignment • Algorithm • Results • Future work

Multilingual text • Same contents in two or more (natural) languages • Legislative texts of the European Union in all EU languages Subject: Supplies of military equipment to Iraq Objet: Livraisons de matériel militaire à l’Irak

Problem definition • How can multilingual texts be compressed more efficiently relative to compression of each language separately? • Can semantic equivalence be exploited to reduce aggregate corpus size?

Multilingual-text alignment (1) • Mapping of equivalent text fragments to each other • Paragraph/sentence and word/phrase levels • Algorithms for both levels • Tokenization, lemmatization, shallow parsing • Alignment possibly partial

Multilingual-text alignment (2)

Linear alignment • Given two parallel fragments S and T, the linear alignment of a token tjin T is the token siin S such that:

Correct vs. linear alignment

Offset from linear alignment • Signed distance between correct and linear alignments • Usually very small values (mostly [-10, 10])

Compression of multilingual texts using alignment:Basic idea (1) • Compress by replacing words/phrases with pointers to their translations within the other text • Original text restored using bilingual dictionary • Store offsets relative to linear alignment • Small values  small number of values  efficient encoding

Compression of multilingual texts using alignment:Basic idea (2) • Store number of words in pointed fragment • Might be a multi-word phrase • bilan  balance sheet • Single pointer may replace multi-word phrase • matériel militaire  pointer to military equipment • chemin de fer  railway

Basic scheme: Example (option 1) • Prefixes: 0 - word, 1 - pointer • 1(offset, length)

Basic scheme: Example (option 2) • matériel militaire pointer to military equipment • Offset relative to first words

Complication: Words withmultiple possible translations • Sometimes more than one possible translation per word • equipment 1. équipement 2. matériel • Must encode correct translation within pointer • Store index of translation

Complication:Morphological variants (1) • Bilingual dictionary must use one morphological form (lemma) • go  aller stands for:{go, went, gone, going}  {aller, vais, vas, va etc.}

Complication:Morphological variants (2) • Texts include inflected forms • More than one possible lemma(bound {bind, bound}) must indicate correct lemmas for S to enable dictionary lookup • Several variants per lemma must indicate correct inflections of translation wordsto enable restoration of T

Complication: Morphological variants (3) • lower bound • borne inférieure • 1(1,1,0,2,0) 1(-1,1,0,4,1) • borne inférieure • 1(offset, length, lemma(s), translation, variant(s)) • Multiple values for multiple words

Optimizations • No encoding for single option • Relevant for all 3 dictionaries • Sort options by descending order of frequencies • Large number of small values  better encoding • Encode length as (length – 1) • length never 0

Binary encoding (1) • Use 3 Huffman codes • H1: words + pointer prefix • H2: absolute values of offsets • sign bit follows, except for 0 • H3: lengths + indices

Binary encoding (2) • Words: H1(lemma) [H3(variant)] • Pointers: l = length, m = (# of words in translation) H1(ptr_prefix) H2(offset) [sign_bit] H3(l – 1)[H3(lemma0)] … [H3(lemmal - 1)][H3(translation)][H3(variant0)] … [H3(variantm – 1)]

Empirical results • English-French responsa collection of European parliament (ARCADE project) • Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS • Dictionaries exist anyway in large IR systems • Heaps law: Dictionary size is αNβ, where 0.4  β 0.6 • For large corpora, size negligible

Empirical results (2)

Future work • Other test corpora • Other languages • Compress target using lemmatized source • Improve encoding • Bidirectional scheme • Pattern matching within compressed text • Improved model for k languages

Multilingual Text Compression Using Alignment Techniques

Multilingual Text Compression Using Alignment Techniques

Presentation Transcript

Multilingual Editing using RichEdit 4+

Image Compression Using Inpainting

Using compression therapy for venous insufficiency

Multilingual Digital Forensics and Text Analytics

On Compression-Based Text Classification

Text compression

Using ontologies for text processing

Using compression therapy for venous insufficiency

Text Compression

Text independent speaker identification in multilingual environments

Multilingual Information Retrieval using GHSOM

New Compression Codes for Text Databases

Information Access I Multilingual Text Summarization

Text Compression

Text Compression

Text Compression Huffman Coding

MEAD 3.09 A platform for multidocument multilingual text summarization

Language-Model Based Text-Compression