220 likes | 454 Views
Gzip Compression and Decompression. 1. Gzip file format 2. Gzip Compress Algorithm . LZ77 algorithm .Dynamic Huffman coding algorithm 3. Gzip Decompression Algorithm 4. Other Method of data compression and open questions. Gzip file format
E N D
Gzip Compression and Decompression • 1. Gzip file format • 2. Gzip Compress Algorithm .LZ77 algorithm .Dynamic Huffman coding algorithm • 3. Gzip Decompression Algorithm • 4. Other Method of data compression and open questions
Gzip file format • A gzip file consists of a series of “member”. The members simply appear one after another in the file, with no additional information before ,between or after them. • Member format Each member has the following format: +---+---+---+---+---+---+---+---+---+---+ |ID1|ID2|CM|FLG| MTIME |XFL|OS| (more->) +---+---+---+---+---+---+---+---+---+---+ if FLG.FEXTRA set +---+---+---+---+---+---+---+---+---+---+ | XLEN | …XLEN bytes of “extra field” |(more->) +---+---+---+---+---+---+---+---+---+---+
if FLG.FNAME set +---+---+---+---+---+---+---+---+---+---+ | … original file name, zero-terminated …| (more->) +---+---+---+---+---+---+---+---+---+---+ if FLG.COMMENT set +---+---+---+---+---+---+---+---+---+---+ | … file comment, zero-terminated … |(more->) +---+---+---+---+---+---+---+---+---+---+ if FLG.FHCRC set +---+---+ | CRC16| +---+---+ +====================+ | … compressed blocks | (more->) +====================+
+---+---+---+---+---+---+---+---+ | CRC32 | INSIZE | +---+---+---+---+---+---+---+---+ ID1=31,ID2=139, they are used to identify the file as being in gzip format. CM (compression method) This identifies the compression method in the file. CM = 0-7 are reserved. CM = 8 denotes the “deflate” compression method, which is the one customarily used by gzip and which is documented elsewhere. bit 0 FTEXT bit 1 FHCRC bit 2 FEXTRA bit 3 FNAME bit 4 FNAME others reserved. CRC32 INSIZE original size of uncompressed data mod 2^32
2.Gzip compression algorithm Introduction Gzip combine the LZ77 algorithm and dynamic Huffman algorithm to compress data. Gzip use LZ77 algorithm to compress data first, then use dynamic Huffman algorithm to compress the result. 2.1 LZ77 compression algorithm Terms used in the algorithm: .input stream :the sequence of characters to be compressed. .character:the basic element in the input stream. .coding position: the position of input stream being coded. (the beginning of lookahead buffer) .lookahead buffer: the character sequence from the coding position to the end of input stream.
.window: size of w, contains w characters from coding position backwards. i.e. the last w characters processed. . A pointer points the match in the window and also specifies its length. The principle of encoding The algorithm searches the window for longest match with the lookahead buffer and output a pointer for that match. When we find the match, we use data pair <offset, length> to take place of the match. Offset: the offset from the beginning of match to window’s left bound. (length from coding position to the beginning of match) Length: length of match. The encoding algorithm
step1: set the coding position to the beginning of input stream step2: if coding position is not at the end of input stream, search the window for the longest match with the lookahead buffer; else algorithm terminates. step3: if find match, output (off, length,c), c is the character following the match, coding position and window move length+1 bytes forward; else goto step4. step4: output current character at coding position, coding position and windows move 1 byte forward; goto step2. Following is an example to explain the algorithm. Assume the size of window is 10, the content is “abcdbbccaa”, the string to be coded is “abaeaaabaee”. The steps of encoding is following:
step1: the longest match between string and window is “ab”, output (0,2,a), then window and coding position move forward 3 bytes. step2: the character at the current coding position is ‘e’. content of window is “dbbccaaaba”, there is no match with ‘e’, then output ‘e’. Window and coding position move 1 byte forward. step3: Content of window is “bbccaaabae”.Lookahead buffer is “aaabae”, the longest match is itself. Then output (4,6,e). There are many other problems needed to be considered. You can refer the gzip source code and document.
Dynamic Huffman Coding Static Huffman coding algorithm: Assume that we give a set of characters, and frequencies of them. Then we can use the Huffman algorithm to encode for these characters. Dynamic Huffman coding process is a dynamic process to build a Huffman tree. We don’t know the characters and there frequency at first. Following is an example to introduce the process of dynamic huffman algorithm: String: TENNESSEE During the dynamic process of building Huffman tree, we must obey one rule: maintain the sibling property if each node (except the root) has a sibling and if the nodes can be numbered in order of nondecreasing weight with each node adjacent to its sibling. Moreover the parent of a node is higher in the numbering
T • Stage 1 (First occurrence of t ) r 9 / \ 7 0 t(1) 8 • Order: 0,t(1) * r represents the root * 0 represents the null node *t(1) denotes the occurrence of T with a frequency of 1
TE • Stage 2 (First occurrence of e) r 9 / \ 7 1 t(1) 8 / \ 5 0 e(1) 6 • Order: 0,e(1),1,t(1)
TEN • Stage 3 (First occurrence of n ) r 9 / \ 7 2 t(1) 8 / \ 5 1 e(1) 6 / \ 3 0 n(1) 4 • Order: 0,n(1),1,e(1),2,t(1) • It is not a Huffman tree, we need to adjust it to Huffman tree
Reorder: TEN r 9 / \ 7t(1) 2 8 / \ 5 1 e(1) 6 / \ 3 0 n(1) 4 • Order: 0,n(1),1,e(1),t(1),2
TENN • Stage 4 ( Repetition of n ) r 9 / \ 7t(1) 3 8 / \ 5 2 e(1) 6 / \ 3 0 n(2) 4 • Order: 0,n(2),2,e(1),t(1),3 • Sibling property is no more valid, rebuild the tree. • Swap this node with the node whose number is the biggest in the block. • Block: a set of nodes whose weights are the same. • In order to maintain the sibling property, we should swap node (n) with node (t), if the node has subtree, the subtree should be swapped together.
Reorder: TENN r 9 / \ 7n(2) 2 8 / \ 5 1 e(1) 6 / \ 3 0 t(1) 4 • Order: 0,t(1),1,e(1),n(2),2 • t(1),n(2) are swapped
TENNE • Stage 5 (Repetition of e ) r 9 / \ 7n(2) 3 8 / \ 5 1 e(2) 6 / \ 3 0 t(1) 4 • Order: 0,t(1),1,e(2),n(2),3
TENNES • Stage 6 (First occurrence of s) r 9 / \ 7n(2) 4 8 / \ 5 2 e(2) 6 / \ 3 1 t(1) 4 / \ 1 0 s(1) 2 • Order: 0,s(1),1,t(1),2,e(2),n(2),4
TENNESS • Stage 7 (Repetition of s) r 9 / \ 7n(2) 5 8 / \ 5 3 e(2) 6 / \ 3 2 t(1) 4 / \ 1 0 s(2) 2 • Order: 0,s(2),2,t(1),3,e(2),n(2),5 • Sibling property is not valid. Adjust the tree to maintain sibling property.
Reorder: TENNESS r 9 / \ 7 3 4 8 / \ / \ 3 1 s (2) 45 n(2) e(2) 6 / \ 1 0 t(1) 2 • s(2) and t(1) are swapped • e and 3 are also need to be swapped
TENNESSE • Stage 8 (Second repetition of e ) r 9 / \ 7 3 5 8 / \ / \ 3 1 s (2) 45 n(2) e(3) 6 / \ 1 0 t(1) 2 • Order : 0,t(1),1,s(2),e(3),3,n(2),6
Reorder: TENNESSEE r 9 / \ 7 3 6 8 / \ / \ 3 1 s (2) 45 n(2) e(4) 6 / \ 1 0 t(1) 2 sibling property is valid, need to rebuild the Huffman tree.
TENNESSEE • Stage 9 (Second repetition of e ) r 9 / \ 7 e(4) 5 8 / \ 5 n(2) 3 6 / \ 3 1 s(2) 4 / \ 1 0 t(1) 2 Adaptive Huffman decoding is the inverse procedure of encoding.