220 likes | 386 Views
StarNT: Dictionary-based Fast Transform. Weifeng Sun wsun@cs.ucf.edu School of Electrical Engineering and Computer Science University of Central Florida. M5 research group, University of Central Florida. 25 April 2003. Weifeng Sun. 1. Table of Contents.
E N D
StarNT: Dictionary-based Fast Transform Weifeng Sun wsun@cs.ucf.edu School of Electrical Engineering and Computer Science University of Central Florida M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 1
Table of Contents • Preprocessing/Postproprossing Model • Star Transform • StarNT Transform • StarZip • Domain Specific Text Compression Tool • Review M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 2
Current Text Compression Model • First-order Entropy Coder • Huffman (word, canonical) • Arithmetic: arbitrary precision • Statistical Models • PPM(BWT): prediction by context • DMC • Dictionary Models • LZ-family: good compression, fast M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 3
Preprocessing/Postprocessing Model Preprocessor Compression Algorithm Text File Compressed File Decompression Algorithm Postprocessor M5 research group, University of Central Florida Weifeng Sun 25 April 2003 4
Goal of Preprocessor • Accelerate the backend compressing algorithm • The shorter, the faster • Backend compressor oriented • More “delicious” input • Preserve some original context • Provide some “artificial” context • Universal • Text transform M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 5
StarNT: Transform paradigm Transform Encoding Compression Algorithm Text File Transformed File Dictionary Compressed File Transform Encoding Decompression Algorithm M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 6
Table of Contents • Preprocessing/Postproprossing Model • Star Transform • StarNT Transform • StarZip • Domain Specific Text Compression Tool • Review M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 7
Example: Star-encoding Transform dictionary Input text This is a long example to demonstrate the “substitution” method. a * is ** to *a the *** long **** this ***a test ***b method ****** example ******* demonstrate *********** ***a^ ** * **** ******* *a *********** *** “substitution” ******. 100111001100000101011010011100 Lots of compression gain! M5 research group, University of Central Florida Weifeng Sun 25 April 2003 8
Example: LIPT-transform Transform dictionary Input text This is a long example to demonstrate the “substitution” method. a *a is *bq to *be the *cd long *dfa this *dr test *dB method *fb example *gY demonstrate *key *dr^ *bq *a *dfa *gY *be *key *cd “substitution” *fb. 1001110011000001010110 MORE gain! M5 research group, University of Central Florida Weifeng Sun 25 April 2003 9
Table of Contents • Preprocessing/Postproprossing Model • Star Transform • StarNT Transform • StarZip • Domain Specific Text Compression Tool • Review M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 10
StarNT Transform • Fast Transform Encoding/Decoding • Ternary search tree • Fast Backend Compression/Decompression • Shorter transform output • Higher Compression Ratio • More efficient transform • StarZip: Multi-corpus Compression Tool M5 research group, University of Central Florida Weifeng Sun 25 April 2003 11
Example: Ternary Search Tree • Hash table • Binary tree • Digital search tries • Ternary search trees Searching for a string of lengthk in a ternary search tree with nstrings will require at most O(log n+k) CHAR comparisons M5 research group, University of Central Florida Weifeng Sun 25 April 2003 12
StarNT: Efficient Transform • Maintain some original context, provide new “artificial” context • Preserve word frequency information • Use word length information • Index encoding • Codeword denotes the index of the word in the dictionary • Lightning transform decoding. M5 research group, University of Central Florida Weifeng Sun 25 April 2003 13
StarNT: Fast Backend Compression/Decompression • Shorter transform immediate file • The meaning of symbol ‘*’ changed! M5 research group, University of Central Florida Weifeng Sun 25 April 2003 14
StarNT: Compression Performance Bzip2 –9 + StarNT Gzip –9 + StarNT PPMD (k=5) + StarNT 11.2% 16.4% 10.2% • StarNT is better than LIPT • bzip2+StarNT is better than PPMD • in time complexity • compression performance. M5 research group, University of Central Florida Weifeng Sun 25 April 2003 15
StarNT: Timing Performance -- Compared with LIPT • Encoding • Decoding 76.3% 84.9% M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 16
StarNT: Timing Performance -- Compared with Backend Compressor Encoding Bzip2 -9 Gzip -9 PPMD (k=5) 28.1% 50.4% 21.2% Decoding 18.6% Some Increase neglectable M5 research group, University of Central Florida Weifeng Sun 25 April 2003 17
Table of Contents • Preprocessing/Postproprossing Model • Star Transform • StarNT Transform • StarZip • Domain Specific Text Compression Tool • Review M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 18
StarZip: Domain Specific Dictionary • Five corpora used (from ibiblio.com) M5 research group, University of Central Florida Weifeng Sun 25 April 2003 19
StarZip: Preliminary Result -- Compression Performance Bzip2 –9 + StarZip Gzip –9 + StarZip PPMD (k=5) + StarZip 13% 19% 10% M5 research group, University of Central Florida Weifeng Sun 25 April 2003 20
Table of Contents • Preprocessing/Postproprossing Model • Star Transform • StarNT Transform • StarZip • Domain Specific Text Compression Tool • Review M5 research group, University of Central Florida 25 April 2003 Weifeng Sun 21
Review: Philosopy of Preprocessing /Postprocessing • Transfom th txt into som intermdiate form whic can b compresed with betr eficency. • Xploit th natral redndancy of the laguage in makng this tranformaton. M5 research group, University of Central Florida Weifeng Sun 25 April 2003 22