1 / 23

Multilingual Text Compression Using Alignment Techniques

Explore how aligning multilingual text fragments can result in efficient text compression, with detailed algorithms, results from European parliament texts, and future research directions.

wesleym
Download Presentation

Multilingual Text Compression Using Alignment Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein

  2. Outline • Multilingual text • Problem definition • Multilingual-text alignment • Compression of multilingual texts using alignment • Algorithm • Results • Future work

  3. Multilingual text • Same contents in two or more (natural) languages • Legislative texts of the European Union in all EU languages Subject: Supplies of military equipment to Iraq Objet: Livraisons de matériel militaire à l’Irak

  4. Problem definition • How can multilingual texts be compressed more efficiently relative to compression of each language separately? • Can semantic equivalence be exploited to reduce aggregate corpus size?

  5. Multilingual-text alignment (1) • Mapping of equivalent text fragments to each other • Paragraph/sentence and word/phrase levels • Algorithms for both levels • Tokenization, lemmatization, shallow parsing • Alignment possibly partial

  6. Multilingual-text alignment (2)

  7. Linear alignment • Given two parallel fragments S and T, the linear alignment of a token tjin T is the token siin S such that:

  8. Correct vs. linear alignment

  9. Offset from linear alignment • Signed distance between correct and linear alignments • Usually very small values (mostly [-10, 10])

  10. Compression of multilingual texts using alignment:Basic idea (1) • Compress by replacing words/phrases with pointers to their translations within the other text • Original text restored using bilingual dictionary • Store offsets relative to linear alignment • Small values  small number of values  efficient encoding

  11. Compression of multilingual texts using alignment:Basic idea (2) • Store number of words in pointed fragment • Might be a multi-word phrase • bilan  balance sheet • Single pointer may replace multi-word phrase • matériel militaire  pointer to military equipment • chemin de fer  railway

  12. Basic scheme: Example (option 1) • Prefixes: 0 - word, 1 - pointer • 1(offset, length)

  13. Basic scheme: Example (option 2) • matériel militaire pointer to military equipment • Offset relative to first words

  14. Complication: Words withmultiple possible translations • Sometimes more than one possible translation per word • equipment 1. équipement 2. matériel • Must encode correct translation within pointer • Store index of translation

  15. Complication:Morphological variants (1) • Bilingual dictionary must use one morphological form (lemma) • go  aller stands for:{go, went, gone, going}  {aller, vais, vas, va etc.}

  16. Complication:Morphological variants (2) • Texts include inflected forms • More than one possible lemma(bound {bind, bound}) must indicate correct lemmas for S to enable dictionary lookup • Several variants per lemma must indicate correct inflections of translation wordsto enable restoration of T

  17. Complication: Morphological variants (3) • lower bound • borne inférieure • 1(1,1,0,2,0) 1(-1,1,0,4,1) • borne inférieure • 1(offset, length, lemma(s), translation, variant(s)) • Multiple values for multiple words

  18. Optimizations • No encoding for single option • Relevant for all 3 dictionaries • Sort options by descending order of frequencies • Large number of small values  better encoding • Encode length as (length – 1) • length never 0

  19. Binary encoding (1) • Use 3 Huffman codes • H1: words + pointer prefix • H2: absolute values of offsets • sign bit follows, except for 0 • H3: lengths + indices

  20. Binary encoding (2) • Words: H1(lemma) [H3(variant)] • Pointers: l = length, m = (# of words in translation) H1(ptr_prefix) H2(offset) [sign_bit] H3(l – 1)[H3(lemma0)] … [H3(lemmal - 1)][H3(translation)][H3(variant0)] … [H3(variantm – 1)]

  21. Empirical results • English-French responsa collection of European parliament (ARCADE project) • Sizes do not include codes for HWORD and TRANS; also not dictionaries for TRANS • Dictionaries exist anyway in large IR systems • Heaps law: Dictionary size is αNβ, where 0.4  β 0.6 • For large corpora, size negligible

  22. Empirical results (2)

  23. Future work • Other test corpora • Other languages • Compress target using lemmatized source • Improve encoding • Bidirectional scheme • Pattern matching within compressed text • Improved model for k languages

More Related