190 likes | 364 Views
Truly Parallel Burrows-Wheeler Compression and Decompression. James A. Edwards , Uzi Vishkin University of Maryland. Introduction. Lossless data compression Common tool better use of memory (e.g., disk space) and network bandwidth .
E N D
Truly ParallelBurrows-WheelerCompression and Decompression James A. Edwards, Uzi Vishkin University of Maryland
Introduction • Lossless data compression Common tool better use of memory (e.g., disk space) and network bandwidth. • Burrows-Wheeler (BW) compression e.g., bzip2 Relatively higher compression ratio (pro) but slower (con) • Snappy (Google) lower compression ratios but fast. Example For MPI on large machines speed is critical. • Our motivationfast and high compression ratio • Unexpected Prior work unknown to us made empirical follow-up … stronger • Assumption throughout: fixed constant-size alphabet
State of the field • Irregular algorithms: prevalent in CS curriculum and daily work (open-ended problems/programs). • Yet, very limited support on today’s parallel hardware. Even more limited with strong scaling • Low support for irregular parallel code in HW SW developers limit themselves to regular algorithms HW benchmarks optimize HW for regular code … • Namely, parallel data compression is of general interest as an undeniable application representing a big underrepresented “application crowd”
“Truly Parallel” BW compression • Existing parallel approach: break input into blocks, compress blocks in parallel • Practical drawback: good compression & speed only with large input • Theory drawback: not really parallel • Truly parallel: compress entire input using a parallel algorithm • Works for both large and small inputs • Can be combined with block-based approach • Applications of small inputs: • Faster (decompression) & greater compression better use of main memory [ISCA05] & cache [ISCA12] • Warehouse-scale computers.Bandwidth between various pairs of nodes can be extremely different; for MPI, MapReduce low bandwidth between pairs debilitating[HP 5th ed.] (i.e., Snappy was a solution)
Attempts at truly parallel BW compression • A 2011 survey paper [Eirola] stipulates that parallelizing BW could hardly work on GPGPU, and decompression would fall behind further. • Portions require “very random memory accessing” • “…it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.” • The best GPGPU result: even more painful • In 2012, Patel et al. concurrently attempted to develop parallel code for BW compression on GPUs but their best result was 2.8X slowdown. • Patel reported separately 1.2X speedup for decompression (hence, not referenced in SPAA13 version.)
Stages of BW compression & decompression Block-Sorting Transform (BST) Move-to-Front (MTF) encoding Huffman encoding SBST SMTF S SBW Compression InverseBlock-Sorting Transform (IBST) Move-to-Front (MTF) decoding Huffman decoding SMTF SBST SBW S Decompression
Inverse Block-Sorting Transform • Serial algorithm: • Sort characters of SBST; the sorted order T[i] forms a ring i → T[i] • Starting with $, traverse the ring to recover S • Parallel algorithm: • Use parallel integer sorting to find T[i] • Use parallel list ranking to traverse the ring • Both steps require O(log n) time and O(n) work • On current parallel HW list ranking gets you – why we chose this step 0 1 4 Linked ring i → T[i] 5 3 2 6 i 4 0 1 5 2 6 3 (END) rank[i] 6 5 4 3 2 1 0 SBST[i] $ a n a n a b S (read right to left)
Conclusionand where to go from here? • Despite being originally described as a serial algorithm, BW compression can be accomplished by a parallel algorithm. • Material for a few good exercises on prefix sum & list ranking? • For a more detailed description of our algorithm, see reference [4] in our brief announcement. • This algorithm demonstrates the importance of parallel primitives such as prefix sums and list ranking. Requires support of fine-grained, irregular parallelism and sometimes also strong scalingIssues on all current parallel hardware. Indeed: • While recent work from UC Davis (2012) on parallel BW compression on GPUs that we missed taxed ~20% of our originality (same Step 2), • It failed to achieve any speedup on compression. Instead a slowdown of 2.8x. For decompression: 1.2X speedup. • On the UMD experimental Explicit Multi-Threading (XMT) architecture, we achieved speedups of 25x for compression and 13x for decompression [5]. On balance UC Davis paper huge gift: 70x vs. GPU for compression and 11X for decompression.
Where to go from here? • Remaining options for the community • Figure out how to do it on current HW • Or, bash PRAM • Or, the alternative we pursuedDevelop a parallel algorithmthat will work well on buildable HWdesigned to support the best-established parallel algorithmic theory Final thought connecting to several other SPAA presentations • This is an example where MPI on large systems works in tandem with PRAM-like support on small systems. • Intra-node (of a large system) use PRAM compression & decompression algorithms for inter-node MPI messages • Counter-argument to an often unstated position. That we need the same parallel programming model at very large and small scales
References • [4] J. A. Edwards and U. Vishkin. Parallel algorithms for Burrows-Wheeler compression and decompression. TR, UMD, 2012. http://hdl.handle.net/1903/13299. • [5] J. A. Edwards and U. Vishkin. Empirical speedup study of truly parallel data compression. TR, UMD, 2013. http://hdl.handle.net/1903/13890.
Block-Sorting Transform (BST) • Goal: bring occurrences of characters together • Serial algorithm: • Form a list of all rotations of the input string • Sort the list lexicographically • Take the last column of the list as output • Equivalent to sorting the suffixes of the input string banana$ Input to BST banana$ anana$b nana$ba ana$ban na$bana a$banan $banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Sort List of rotations annb$aa Output of BST
Block-Sorting Transform (BST) • Parallel algorithm: • Find the suffix tree of S (O(log2n) time, O(n) work)) • Find the suffix array SA of S by traversing the suffix tree (Euler tour technique: O(log n) time, O(n) work) • Permute characters according to SA (O(1) time, O(n) work) $ na a banana$ 6 $ na na$ 0 $ 5 na$ $ 4 2 1 3
Move-to-Front (MTF) encoding • Goal: Assign low codes to repeated characters • Serial algorithm: Maintain list of characters in order last seen • Parallel algorithm: use prefix sums to compute the MTF list for each character (O(log n) time, O(n) work) • Associative binary operator: X+Y = Y concat (X – Y) i 0 1 2 3 SBST[i] a n n b L0[j] L1[j] L2[j] L3[j] j j j j 0 $ 0 a 0 n 0 n Li 1 a 1 $ 1 a 1 a 2 b 2 b 2 $ 2 $ 3 n 3 n 3 b 3 b SMTF[i] 1 3 0 3 a,$,b,n b,n,a,$ a,$ $,a,b,n b,n,a a,$ b,n $,a n,a b,n a,$ a n b a $ a n n b $ a a assumed prefix SBST
Move-to-Front (MTF) decoding • Same algorithm as encoding, with the following changes • Serial: The MTF lists are used in reverse • Parallel: Instead of combining MTF lists, combine permutation functions SMTF 1 3 0 3 0 1 0 3 0 3 0 0 1 0 1 0 1 1 1 0 2 2 2 1 2 2 2 1 Permutation function 3 3 3 2 3 3 3 2 0 1 0 3 1 0 1 0 + 2 2 2 1 3 3 3 2 0 3 0 3 1 1 1 1 = 2 0 2 0 3 2 3 2
Huffman Encoding • Goal: Assign shorter bit strings to more-frequent MTF codes • The parallelization of this step is already well known • Serial algorithm: • Count frequencies of characters • Build Huffman table based on frequencies • Encode characters using the table • Parallel algorithm: • Use integer sorting to count frequencies (O(log n) time, O(n) work) • Build Huffman table using the (standard, heap-based) serial algorithm (O(1) time and work) • (a) Compute the prefix sums of the code lengths to determine where in the output to write the code for each character (O(log n) time, O(n) work)(b) Actually write the output (O(1) time, O(n) work)
Huffman Decoding • Serial algorithm: Read through compressed data, decoding one character at a time • Parallel algorithm: partition input and apply serial algorithm to each partition • Problem: Decoding cannot start in the middle of the codeword for a character • Solution: Identify a set of valid starting bits using prefix sums (O(log n) time, O(n) work) 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0
Huffman Decoding • How to identify valid starting positions: • Divide the input string into partitions of length l (the length of the longest Huffman codeword) • Assign a processor to each bit in the input. Processor i decodes the compressed input starting at index i and stops when it crosses a partition boundary, recording the index where it stopped. (O(1) time, O(n) work) • Now each partition has l pointers entering it, all of which originate from the immediately preceding partition. • Use prefix sums to merge consecutive pointers. (O(log n) time, O(n) work) • Now each partition still has l pointers entering it, but they all originate from the first partition. • For each bit in the input, mark it as a valid starting position if and only if the pointer that points to that bit originates from the first bit (index 0) of the first partition (O(1) time, O(n) work)
Lossless data compression on GPGPU architectures (2011) • Inverse BST: “Problems would possibly arise from poor GPU performance of the very random memory accessing caused by the scattering of characters throughout the string.” • MTF decoding: “Speeding up decoding on GPGPU platforms might be more challenging since the character lookup is already constant time on serial implementations, and starting decoding from multiple places is difficult since the state of the stack is not known at the other places.” • Huffman decoding: “Here again, decompression is harder. This is due to the fact that the decoder doesn’t know where one codeword ends and another begins before it has decoded the whole prior input.” • “As for the codeword tables for the VLE, it seems unlikely that efficient Huffman-tree GPGPU algorithms will be possible.”