80 likes | 102 Views
Language-Model Based Text-Compression. James Connor Antoine El Daher. Compressing with Structure. Compression Huffman Arithmetic Lempel Ziv (LV78 LV77) Most popular compression tools based on LV77 Exploiting structure
E N D
Language-Model Based Text-Compression James Connor Antoine El Daher
Compressing with Structure • Compression • Huffman • Arithmetic • Lempel Ziv (LV78 LV77) • Most popular compression tools based on LV77 • Exploiting structure • Our goal: incorporate prior knowledge about the structure of the input sequence
Perplexity and Entropy • Compression ratio is bounded by the Entropy of the sequence to be compressed: • A low-perplexity language model is also a low-entropy distribution:
Character N-grams • Represent text as an nth order markov chain of characters • Maintain counts of n-grams • Build a library of huffman tables based on these counts
Compressing the file • Training • For each bigram in the training set, we keep a map of all the words that can follow it, along with their probabilities. • E.g. “to have” (“seen”, 0.1), (“been”, 0.1), (UNK, 0.1), etc. • Then for each bigram, we build a Huffman tree.
Compressing the File • Compressing: • We go through the input file, using the Huffman trees from the training set to code each word based on the two preceding words. • If the trigram is unknown, we code the UNK token, the revert to a unigram model (also coded using Huffman). • If the unigram is unknown, we use a character level Huffman (trained on the training set) to code it. • Decompression works similarily; we mimic the same behavior
Extensions • We have a sliding context window, so that whenever we are compressing a file, words that are seen there have their counts incremented when they enter the window (and decremented when they leave); this allows us to make better use of the local context in terms of trigrams/bigrams, and give more representative weights.
Results • Competitive with Gzip