2. Text Compression

2. Text Compression 강의 노트 (2주)

압축이 필요한 이유 • 컴퓨터 하드웨어 발전  필요한 자료의 양의 증가 속도 • 인터넷 홈페이지 • 새로운 응용  멀티미디어, Genome, 전자도서관, 전자상거래, 인트라넷 • 압축이 되면 처리 속도도 빨라진다!!!!

역사 • 1950’s : Huffman coding • 1970’s Ziv Lempel, Arithmetic coding • English Text • Huffman (5bits/character)  • Ziv-Lempel (4bits/character)  • Arithmetic coding (2bits/character) • PPM ::: Prediction by Partial Matching • Slow and require large amount of memory

강의 내용 • Models • Adaptive models • Coding • Symbolwise models • Dictionary models • Synchronization • Performance comparison

Symbol-wise methods • Estimating the probabilities of symbols • Huffman coding or arithmetic coding • Modeling : estimating probabilities • Coding: converting the probabilities into bitstreams

Dictionary methods • Code references to entries in the dictionary • Several symbols as one output codeword • Statistical method • Ziv-Lempel coding by referencing (pointing) previous occurrence of strings  adapt • Hybrid schemes  효율은 symbolwise schemes보다 좋지 않으나 속도 증가

Models • prediction !!!! Fig. 2.1 • Information Content I(s) = -log Pr[s] (bits) • 확률분포의 entropy H =  Pr[s]·I(s) = - Pr[s]·logPr[s] • prediction이 매우 잘 되면  Huffman coding은 성능이 나빠진다!!!

Pr[] • 확률이 ‘1’이면 전송이 필요 없다 • 확률이 ‘0’이면 coding될 수 없다 • ‘u’의 확률이 2%이면 5.6bits 필요 • ‘q’다음에 ‘u’가 95% 확률로 나오면  0.074bits 필요 •  잘못 예측하면 추가의 bit가 소요!!!

Models • finite-context model of order m - 앞에 나온 m개의 symbol을 이용하여 예측 • finite-state model [Figure 2.2]

Modeling 방법 • static modeling - 텍스트의 내용에 관계없이 항상 같은 모델 사용 • semi-static modeling • 각각의 파일마다 새로운 모델 사용 • 사전에 모델을 전송해야!!! • adaptive modeling • 새로운 symbol을 만날 때마다 확률 분포가 변화

Adaptive models • zero-order model  character by character • zero frequency problem • 어떤 character가 지금까지 한번도 나타나지 않았을 때 • 1/(46*(768,078+1)) ? 1/(768,079+128) • higher-order model • first-order model ::: 37,526(‘h’)  1,139(‘t’)1,139/37,526 93.02%) (0-probability는 무시 • second-order model ::: ‘gh’ ‘t’ (64%, 0.636bits)

adaptive modeling • 장점 • Robust, Reliable, Flexible • 단점 • Random access is impossible • fragile on communication errors • Good for general compression utilities but not good for full-text retrieval

Coding • coding의 기능 - model에 의해 제공된 확률 분포를 바탕으로 symbol을 어떻게 나타낼지를 결정 • coding시 주의점 - short codewords for likely symbols - long codewords for rare symbols - 속도 • Huffman coding • Arithmetic coding

Huffman Coding • static model을 사용할 때 encoding과 decoding속도가 빠름 • adaptive Huffman coding - memory나 시간이 많이 필요 • full-text retrieval application에 유용 • random access가 용이

Examples • a 0000 0.05 b 0001 0.005 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1 • Eefggfed • 10101101111111101001 • Prefix-(free) code

Algorithm • Fig. 2.6 설명 • Fast for both encoding and decoding • Adaptive Huffman coding도 있으나 arithmetic coding이 오히려 나음

Canonical Huffman Coding I • Huffman code와 같은 길이의 codeword 사용 • codeword의 길이가 긴 것부터 저장 • 같은 빈도로 나타나는 단어인 단어는 자모순 • encoding은 쉽게 코드의 길이와 같은 길이의 첫 번째 코드에서 상대적 위치와 첫번째 코드만 알면 가능 • 예 ::: Table 2.2에서 ‘said’는 7bit짜리 중에서 10번째, 첫번째 코드 ‘1010100’ ‘1010100’+’1001’ = ‘1011101`

Canonical Huffman Coding II • Decoding : 심벌을 Codeword의 순서대로 저장 + 코드길이에 따른 첫번 째 코드 • 1100000101… 7bits(‘1010100), 6bits(110001) … 7bits에서 12번째 뒤 (with) • decoding tree를 사용하지 않음

Canonical Huffman Coding III • Word와 확률만 정해지면 유일함 • 표 2.3 참고 • Canonical Huffman code는 Huffman algorithm에 의해 만들어 지지 않을 수 있다!!!!!!! • Huffman이 말한 바에 따르면 알고리즘이 바뀌어야 한다!!!! 코드 길이를 계산하는 것으로 !!! • n개 symbol에 대해  2n-1 • 그 중 한 개가 canonical Huffman code

Canonical Huffman code IV • Tree를 만들 필요가 없으므로 memory 절약 • Tree를 찾을 필요가 없으므로 시간 절약 • 코드길이를 먼저 알고, 위치를 계산하여 코드 값을 부여한다…방법 설명 – • 긴 것 부터!!! 1씩 더하면 !!!! 길이에 맞게 자르면 !!!! • [바로 큰 길이 첫번 째 코드 + 동일 코드 개수 +1]에 길이만큼 자르면 !!!!

알고리즘 • 단순히 tree를 만들면  24n bytes • 값, pointer (2개) • Intermediate node + leaf node  2n • 8n bytes 알고리즘 • Heap의 사용 • 2n개 정수 array • 알고리즘은 직접 쓰면서 설명 !!!!!

Arithmetic Coding • 복잡한 model을 사용하여 높은 압축률 얻음 - entropy에 근접한 길이로 coding • 한 symbol을 1bit 이하로 표현 가능  특히 한 symbol이 높은 확률로 나타날 때 유리 • tree를 저장하지 않기 때문에 적은 메모리 필요 • static이나 semi-static application에서는 Huffman coding보다 느림 • random access 어려움

Huffman Code와 Arithmetic Code

Transmission of output • low = 0.6334  high = 0.6667 • ‘6’, 0.334  0.667 • 32bit precession으로 크게 압출률 감소는 없음

Arithmetic Coding (Static Model)

Decoding(Static Model)

Arithmetic Coding (Adaptive Model)

Decoding(Adaptive Model)

Cumulative Count Calculation • 방법 설명 • Heap • Encoding 101101  101101, 1011, 101, 1 • 규칙 설명

Symbolwise models  Symbolwise model + coder( arithmatic, huffman )  Three Approaches - PPM( Prediction by Partial Matching ) - DMC(Dynamic Markov Compression ) - Word-based compression

PPM ( Prediction by Partial Matching )  finite-context models of characters  variable-length code 이전의 code화 된 text와 partial matching  zero-frequency problem - Escape symbol - Escape symbol을 1로 count (PPMA)

Escape method • Escape method A (PPMA)  count 1 • Exclusion • Method C :: r/(n+r)  total n, distinct symbols r, ci/(n+r) • Method D :: r/(2n) • Method X :: symbols of frequency 1  t1, (t1+1)/(n+t1+1) • PPMZ, Swiss Army Knife Data Compression (SAKDC) • 그림 2,24

Block-sorting compression

DMC ( Dynamic Markov Compression )  finite state model  adaptive model - Probabilties and the structure of the finite state machine  Figure 2.13  avoid zero-frequency problem  Figure 2.14  Cloning - heuristic - the adaptation of the structure of a DMC

Word-based Compression  parse a document into “words” and “nonwords”  Textual/Non-Textual 구분 압축- Textual : zero-order model  suitable for large full-text database  Low Frequency Word - 비효율적- 예) 연속된 Digit, Page Number

Dictionary Models  Principle of replacing substrings in a text with codeword  Adaptive dictionary compression model : LZ77, LZ78  Approaches - LZ77 - Gzip - LZ78 - LZW

Dictionary Model - LZ77  adaptive dictionary model  characteristic - easy to implement - quick decoding - using small amount of memory  Figure 2.16  Triples < offset, length of phrase, character >

Dictionary Model - LZ77(continue)  Improve - offset : shorter codewords for recent matches - match length : variable length code - character : 필요시에만 포함(raw data 전송)  Figure 2.17

Dictionary Model - Gzip  based on LZ77  hash table  Tuples < offset, matched length >  Using Huffman code - semi-static / canonical Huffman code - 64K Blocks - Code Table : Block 시작 위치

Dictionary Model - LZ78  adaptive dictionary model  parsed phrase reference  Figure 2.18  Tuples - < phrase number, character > - phrase 0 : empty string  Figure 2.19

Dictionary Model - LZ78(continue)  characteristic - hash table : simple, fast - encoding : fast - decoding : slow - trie : memory 사용 많음

Dictionary Model - LZW  variant of LZ78  encode only the phrase number does not have explicit characters in the output  appending the fast character of the next phrase  Figure 2.20  characteristic - good compression - easy to implement

Synchronization  random access  impossible random access - variable-length code - adaptive model  synchronization point  synchronization with adaptive model - large file -> break into small sections

Creating synchronization point  main text : consist of a number of documents  bit offset - 문서의 시작/끝에 추가 bit로 길이 표시  byte offset - end of document symbol - length of each document at its beginning - end of file

Self-synchronizing codes  not useful or full-text retrieval  motivation - compressed text의 중간에서 decoding synchronizing cycle을 찾아 decoding - part of corrupteed, beginning is missing  fixed-length code : self-synchronizing 불가  Table 2.3  Figure 2.22

Performance comparisons  consideration - compression speed - compression performance - computing resource  Table 2.4

Compression Performance  Calgary corpus - English text, program source code, bilevel fascimile image - geological data, program object code  Bits per character  Figure 2.24

Compression speed  speed dependency - method of implementation - architecure of machine - compiler  Better compression, Slower program run  Ziv-Lempel based method : decoding > encoding  Table 2.6

Other Performance considerations  memory usage - adaptive model : 많은 memory 사용 - Ziv-Lempel << Symbolwise model  Random access - synchronization point

2. Text Compression