Lossless Compression Based on the Sequence Memoizer

Lossless Compression Based on the Sequence Memoizer Jan Gasthausy, Frank Wood, Yee WhyeTeh Presented by Yingjian Wang Jan. 11, 2011

Outline • Background • The sequence memoizer • The PLUMP • Experiment results

Background 2006,Teh, ‘A hierarchical Bayesian language model based on Pitman-Yor processes’ : N-gram Markov chain language model with the HPY prior. 2009, Wood, ‘A Stochastic Memoizer for Sequence Data’: The Sequence Memoizer (SM) with linear space/time inference scheme. (lossless) 2010, Gasthaus, ‘Lossless compression based on the Sequence Memoizer’: Combine the SM with an arithmetic coder to develop a compressor (dePLUMP); and an incremental inference algorithm. 2010, Gasthaus, ‘Improvements to the sequence memoizer’: an enlarged range of hyperparameters, a memory efﬁcient representation, and inference algorithms for the new representation. 2010, Bartlett, ‘Forgetting Counts : Constant Memory Inference for a Dependent HPY’: Develop a constant memory/space inference for the SM, by using a dependent HPY. (with loss)

Sequence memoizer • Non-Markov (Unbounded depth Markov) model: • Hierarchical Pitman-Yor prior:

Sequence memoizer (2) Achieve linear time/space inference by collapsing the non-branching restaurants. # of restaurants: O(T^2)O(T)

Plump (Power Law Unbounded Markov Prediction www.deplump.com) • The first contribution: Word: Byte; Vocabulary size: 2^8=256; Coding: entropy coding (the average number of bits per symbol)

Plump (2) • The second contribution: New inference method: 1PF. Since the MCMC is computationally intensive, for i-th iteration, only update the restaurants associated with the i-th observation. This corresponds to the sequential MC with particle filter with 1 particle. • Two justifications: First, the posterior is unimodal, so it is easy for the particle ﬁlter to obtain a sample close to the mode and this single sample is likely to perform well. Secondly, much of the predictive ability of the SM comes from the hierarchical sharing of counts which is already present in a single sample.

Experiment results Calgary corpus: A well known compression benchmark corpus consisting of 14 ﬁles of different types and varying lengths.

Experiment results (2) • With larger vocabulary: Chinese version bible. • Vocabulary size: 2^16; 16-bit characters to compress Big-5 encoded Chinese text. • Average bits per symbol: 4.35 bits per Chinese Character; better than the 5.44 bits of PPMZ.

Lossless Compression Based on the Sequence Memoizer

Lossless Compression Based on the Sequence Memoizer

Presentation Transcript

Energy Aware Lossless Data Compression

Lossless Compression - I

Lossless data compression

Lossless Compression

IMAGE LOSSLESS COMPRESSION(LZW)

CULZSS LZSS Lossless Data Compression on CUDA

Lossless Compression

On Evaluating the Performance of Compression Based Techniques for Sequence Comparison

The Sequence Memoizer

The Sequence Memoizer

Lossless/Near-lossless Compression of Still and Moving Images

A Gradient Based Predictive Coding for Lossless Image Compression

Lossless Compression - II

Lossless DNA Microarray Image Compression

Lossless Compression(2)

Lossless Compression

Chapter 7 Lossless Compression Algorithms

Lossless Image Compression

Chapter 5 : IMAGE COMPRESSION – LOSSLESS COMPRESSION -

Lossless Compression - I

Benefits of Lossless Compression

Lossless Image Compression