90 likes | 240 Views
Lossless Compression Based on the Sequence Memoizer. Jan Gasthausy , Frank Wood, Yee Whye Teh Presented by Yingjian Wang Jan. 11, 2011. Outline. Background The sequence memoizer The PLUMP Experiment results. Background.
E N D
Lossless Compression Based on the Sequence Memoizer Jan Gasthausy, Frank Wood, Yee WhyeTeh Presented by Yingjian Wang Jan. 11, 2011
Outline • Background • The sequence memoizer • The PLUMP • Experiment results
Background 2006,Teh, ‘A hierarchical Bayesian language model based on Pitman-Yor processes’ : N-gram Markov chain language model with the HPY prior. 2009, Wood, ‘A Stochastic Memoizer for Sequence Data’: The Sequence Memoizer (SM) with linear space/time inference scheme. (lossless) 2010, Gasthaus, ‘Lossless compression based on the Sequence Memoizer’: Combine the SM with an arithmetic coder to develop a compressor (dePLUMP); and an incremental inference algorithm. 2010, Gasthaus, ‘Improvements to the sequence memoizer’: an enlarged range of hyperparameters, a memory efficient representation, and inference algorithms for the new representation. 2010, Bartlett, ‘Forgetting Counts : Constant Memory Inference for a Dependent HPY’: Develop a constant memory/space inference for the SM, by using a dependent HPY. (with loss)
Sequence memoizer • Non-Markov (Unbounded depth Markov) model: • Hierarchical Pitman-Yor prior:
Sequence memoizer (2) Achieve linear time/space inference by collapsing the non-branching restaurants. # of restaurants: O(T^2)O(T)
Plump (Power Law Unbounded Markov Prediction www.deplump.com) • The first contribution: Word: Byte; Vocabulary size: 2^8=256; Coding: entropy coding (the average number of bits per symbol)
Plump (2) • The second contribution: New inference method: 1PF. Since the MCMC is computationally intensive, for i-th iteration, only update the restaurants associated with the i-th observation. This corresponds to the sequential MC with particle filter with 1 particle. • Two justifications: First, the posterior is unimodal, so it is easy for the particle filter to obtain a sample close to the mode and this single sample is likely to perform well. Secondly, much of the predictive ability of the SM comes from the hierarchical sharing of counts which is already present in a single sample.
Experiment results Calgary corpus: A well known compression benchmark corpus consisting of 14 files of different types and varying lengths.
Experiment results (2) • With larger vocabulary: Chinese version bible. • Vocabulary size: 2^16; 16-bit characters to compress Big-5 encoded Chinese text. • Average bits per symbol: 4.35 bits per Chinese Character; better than the 5.44 bits of PPMZ.