2. Text Compression

2. Text Compression 강의 노트 (2주)

압축이 필요한 이유 • 이유 • 컴퓨터 하드웨어 발전  필요한 자료의 양의 증가 속도 (통신, 저장) ::: 따라잡을 수 없음: Parkinson’s Law • 인터넷 홈페이지 • 새로운 응용  멀티미디어, Genome, 전자도서관, 전자상거래, 인트라넷 • 압축이 되면 처리 속도도 빨라진다!!!! • 하드디스크 접근 • 통신속도 • 예부터 • Morse 코드, Braille 코드, 속기용 자판

최근 • 흐름 • PC클러스터, RAID 일반화 • 주기억장치 DB 활용 • 통신속도 향상 (인터넷, 내부통신) • Network is computing!!! • 하지만 • 멀티미디어 자료, 대용량 자료는 압축이 필요 • 상대적으로 대용량 역파일의 압축 중요성은 줄었지만, 때에 따라서는 필요

종류 • Text compression  완벽한 원상복귀 • Multi-media ::: 약간의 변화나 잡음은 허용함

역사 • 1950’s : Huffman coding • 1970’s Ziv Lempel(Lampel-Ziv-Welch(gif)), Arithmetic coding • English Text • Huffman (5bits/character) • adaptive • Ziv-Lempel (4bits/character) : 70년대 • Arithmetic coding (2bits/character) • PPM ::: Prediction by Partial Matching • 80년대 초 • Slow and require large amount of memory • 이 후 더 효과적인 방법은 나오지 않고 속도나 MEMORY를 줄이면서 약간 압축률은 손해보는 형태만 나옴 • 0.5~1Mbytes, 0.1Mbytes 아래에서는 Ziv Lempel이 효과적 • 영어 text 압축은 1비트로 보며, 그 이상은 의미적 관계나 다른 외부적 지식을 이용해야 할 것으로 봄 • 문법 이용, space 복원

강의 내용 • Models • Adaptive models • Coding • Symbolwise models • Dictionary models • Synchronization • Performance comparison

크게 분류 • 방법 • Symbol-wise method • Dictionary method • 압축 • Models • static  adaptive • Coding

Symbol-wise methods • Estimating the probabilities of symbols • Statistical methods • Huffman coding or arithmetic coding • Modeling : estimating probabilities • Coding: converting the probabilities into bitstreams

Dictionary methods • Code references to entries in the dictionary • Several symbols as one output codeword • Group symbols  dictionary • Ziv-Lempel coding by referencing (pointing) previous occurrence of strings  adaptive • Hybrid schemes  효율은 symbol-wise schemes보다 좋지 않으나 속도 증가

Models • prediction - To predict symbols, which amounts to providing a probability distribution for the next symbol to be coded - 모델의 역할 : coding & decoding • Information Content I(s) = -log Pr[s] (bits) • 확률분포의 entropy: Claude Shannon H =  Pr[s]·I(s) = - Pr[s]·logPr[s] (a lower bound on compression) • Entropy가 0에 수렴하면 압축 가능성은 극대화 됨  Huffman coding은 성능이 나빠진다!!! 이유??? • Zero probability, Entropy가 극단적으로 크면(확률이 0이면), 코드로 표현이 불가능해 진다.

Pr[] • 확률이 ‘1’이면 전송이 필요 없다 • 확률이 ‘0’이면 coding될 수 없다 • ‘u’의 확률이 2%이면 5.6bits 필요 • ‘q’다음에 ‘u’가 95% 확률로 나오면  0.074bits 필요 •  잘못 예측하면 추가의 bit가 소요!!!

Model의 표현 • finite-context model of order m - 앞에 나온 m개의 symbol을 이용하여 예측 • finite-state model - [Figure 2.2] • The decoder works with an identical probability distribution - synchronization - On error, synchronization would be lost • Formal languages as C, Java • grammars, …

Estimation of probabilities in a Model • static modeling • 텍스트의 내용에 관계없이 항상 같은 모델 사용 • 영어문서 문자가 많은 문서, 문자가 많은 문서  모스부호 • 같은 문서 내에서도 다른 형태가??? • semi-static (semi-adaptive) modeling • 각각의 파일마다 새로운 모델을 encoding하는 곳에서 만들어 전송 • 사전에 모델을 전송하는 비용이 모델이 복잡하면 심각할 수 있음 • adaptive modeling - 좋지 않는 model에서 시작하여 전송되어 오는 내용을 보고 model을 바꿈 • 새로운 symbol을 만날 때마다 확률 분포가 변화

Adaptive models • zero-order model  character by character • zero frequency problem • 어떤 character(예, ‘z’)가 지금까지 한 번도 나타나지 않았을 때 • 128개 ASCII 중 82개 문자가 나오고, 46개가 안 나왔을 때 • 1/(46*(768,078+1))  25.07bits • 1/(768,078+128)  19.6bits • 큰 문서에서는 중요하지 않으나 작거나 다양한 문자를 사용 또는 문맥이 바뀔 때는 중요 • higher-order model - 0-probability는 일단 고려하지 않음 • first-order model ::: 37,526(‘h’)  1,139(‘t’) 1,139/37,526  9.302%  5.05bits (0-order보다 못함)  이유는 ??? • second-order model ::: ‘gh’ ‘t’ (64%, 0.636bits) • 다양한 형태로 변경 가능 : encoding과 decoding 부분이 같은 모델을 쓰는 한 (synchronization)

adaptive modeling • 장점 • Robust, Reliable, Flexible • 단점 • Random access is impossible • fragile on communication errors • Good for general compression utilities but not good for full-text retrieval

Coding • coding의 기능 - model에 의해 제공된 확률 분포를 바탕으로 symbol을 어떻게 나타낼지를 결정 • coding시 주의점 • 코드길이 • short codewords for likely symbols • long codewords for rare symbols • 확률분포에 따라 최저평균길이는 정해지며, 여기에 가깝게 함 • 속도 • 속도가 중요한 요소면 압축률을 어느 정도 희생 • symbolwise scheme은 coder에 의존  사전적 방법과 다름 • Huffman coding : 속도가 빠름 • Arithmetic coding : 압축률이 이론적 한계에 가까움

Huffman Coding • static model을 사용할 때 encoding과 decoding속도가 빠름 • adaptive Huffman coding - memory나 시간이 많이 필요 • full-text retrieval application에 유용 - random access가 용이

Examples • a 0000 0.05 b 0001 0.005 c 001 0.1 d 01 0.2 e 10 0.3 f 110 0.2 g 111 0.1 • Eefggfed • 10101101111111101001 • Prefix-(free) code

Huffman coding: Algorithm • Fig. 2.6 설명 • Fast for both encoding and decoding • Adaptive Huffman coding도 있으나 arithmetic coding이 오히려 나음 • 궁극적으로 random access가 불가능 • 기억용량, 속도 등에서 유리하지 않음 • Words-based approach와 결합하면 좋은 결과를 줌

Canonical Huffman Coding I • a static zero-order word-level Canonical Huffman Coding : 표 2.2 • Huffman code와 같은 길이의 codeword 사용 - codeword의 길이가 긴 것부터 저장 - 같은 빈도로 나타나는 단어인 단어는 자모순 - encoding은 쉽게 코드의 길이와 같은 길이의 첫 번째 코드에서 상대적 위치와 첫번째 코드만 알면 가능 - 예 ::: Table 2.2에서 ‘said’는 7bit짜리 중에서 10번째, 첫번째 코드 ‘1010100’ ‘1010100’+’1001’ = ‘1011101`

Canonical Huffman Coding II • Decoding : 심벌을 Codeword의 순서대로 저장 + 코드길이에 따른 첫번 째 코드 • 1100000101… 7bits(‘1010100), 6bits(110001) … 7bits에서 12번째 뒤 (with) • decoding tree를 사용하지 않음

Canonical Huffman Coding III • Word와 확률만 정해지면 유일함 • 표 2.3 참고 • Canonical Huffman code는 Huffman algorithm에 의해 만들어 지지 않을 수 있다!!!!!!!  any prefix-free assignment of codewords where the length of each code is equal to the depth of that symbol in a Huffman tree • Huffman이 말한 바에 따르면 알고리즘이 바뀌어야 한다!!!! 코드 길이를 계산하는 것으로 !!! • n개 symbol에 대해  2n-1 • 그 중 한 개가 canonical Huffman code

Canonical Huffman code IV • 장점 • Tree를 만들 필요가 없으므로 memory 절약 • Tree를 찾을 필요가 없으므로 시간 절약 • 코드길이를 먼저 알고, 위치를 계산하여 코드 값을 부여한다…방법 설명 – • 긴 것 부터!!! 1씩 더하면 !!!! 길이에 맞게 자르면 !!!! • [바로 큰 길이 첫 번째 코드 + 동일 코드 개수]를 길이만큼 잘라 +1]을 하면 됨 • (예) 5bits 4, 3bits 1, 2bits 3  00000, 00001,00010, 001, 01, 10, 11

알고리즘 • 단순히 tree를 만들면  24n bytes • 값, pointer (2개) • Intermediate node + leaf node  2n • 8n bytes 알고리즘 • Heap의 사용 • 2n개 정수 array • 알고리즘은 직접 쓰면서 설명 • 코드길이 계산

Arithmetic Coding • 평균적으로 엔트로피보다 짧게 압축하기는 불가능 • 복잡한 model을 사용하여 높은 압축률 얻음 - entropy에 근접한 길이로 coding • 한 symbol을 1bit 이하로 표현 가능  특히 한 symbol이 높은 확률로 나타날 때 유리 • tree를 저장하지 않기 때문에 적은 메모리 필요 • static이나 semi-static application에서는 Huffman coding보다 느림 • random access 어려움

Huffman Code와 Arithmetic Code

실제 예 • 0.99, 0.01의 확률로 두 심볼이 나올 때 • Arithmetic coding: 0.015bit • Huffman coding: (symbol당 inefficiency) Pr(s1)+log(2log2/e) ~ Pr[s1]+0.086 (여기서 s1은 가장 빈도가 높은 심볼) : 1.076bits • 영어문서 entropy : 5bits per character (0-order character level) • 공백문자 비중: 0.18  0.266 • 0.266/5bits  5.3%의 inefficiency • 이미지 : 주로 2 가지 symbol  arithmetic coding

Transmission of output • low = 0.6334  high = 0.6667 • ‘6’, 0.334  0.667 • 32bit precession으로 크게 압축률 감소는 없음

Arithmetic Coding (Static Model)

Decoding(Static Model)

Arithmetic Coding (Adaptive Model)

Decoding(Adaptive Model)

Cumulative Count Calculation • 방법 설명 • Heap • Encoding 101101  101101, 1011, 101, 1 • 규칙 설명

Symbolwise models  Symbolwise model + coder( arithmatic, huffman )  Three Approaches - PPM( Prediction by Partial Matching ) - DMC(Dynamic Markov Compression ) - Word-based compression

PPM ( Prediction by Partial Matching )  finite-context models of characters  variable-length code 이전의 code화 된 text와 partial matching  zero-frequency problem - Escape symbol - PPMA: escape method A: escape symbol을 1로

Escape method • Escape method A (PPMA)  count 1 • Exclusion  중복되지만 사용되지 않는 것은 제외, 예) lie+s (201, 22), ?lie+s에서 처리  179번 lie, lie+r 19회 … 19/202  19/180 • Method C :: r/(n+r)  total n, distinct symbols r, ci/(n+r)  2.5bits per character for Hardy’s book. • Method D :: r/(2n) • Method X :: symbols of frequency 1  t1, (t1+1)/(n+t1+1) • PPMZ, Swiss Army Knife Data Compression (SAKDC) 1991년, 1197년 박사학위 논문 • 그림 2,24

Block-sorting compression • 1994년에 도입 • 문서를 압축이 쉽게 변환 • Image compression  discrete cosine transformation, Fourier transformation과 비슷 • Input이 block 단위로 나뉘어 있어야!!!

DMC ( Dynamic Markov Compression )  finite state model  adaptive model - Probabilties and the structure of the finite state machine  Figure 2.13  avoid zero-frequency problem  Figure 2.14  Cloning - heuristic - the adaptation of the structure of a DMC

Word-based Compression  parse a document into “words” and “nonwords”  Textual/Non-Textual 구분 압축- Textual : zero-order model  suitable for large full-text database  Low Frequency Word - 비효율적- 예) 연속된 Digit, Page Number

Dictionary Models  Principle of replacing substrings in a text with codeword  Adaptive dictionary compression model : LZ77, LZ78  Approaches - LZ77 - Gzip - LZ78 - LZW

Dictionary Model - LZ77  adaptive dictionary model  characteristic - easy to implement - quick decoding - using small amount of memory  Figure 2.16  Triples < offset, length of phrase, character >

Dictionary Model - LZ77(continue)  Improve - offset : shorter codewords for recent matches - match length : variable length code - character : 필요시에만 포함(raw data 전송)  Figure 2.17

Dictionary Model - Gzip  based on LZ77  hash table  Tuples < offset, matched length >  Using Huffman code - semi-static / canonical Huffman code - 64K Blocks - Code Table : Block 시작 위치

Dictionary Model - LZ78  adaptive dictionary model  parsed phrase reference  Figure 2.18  Tuples - < phrase number, character > - phrase 0 : empty string  Figure 2.19

Dictionary Model - LZ78(continue)  characteristic - hash table : simple, fast - encoding : fast - decoding : slow - trie : memory 사용 많음

Dictionary Model - LZW  variant of LZ78  encode only the phrase number does not have explicit characters in the output  appending the fast character of the next phrase  Figure 2.20  characteristic - good compression - easy to implement

Synchronization  random access  impossible random access - variable-length code - adaptive model  synchronization point  synchronization with adaptive model - large file -> break into small sections

Creating synchronization point  main text : consist of a number of documents  bit offset - 문서의 시작/끝에 추가 bit로 길이 표시  byte offset - end of document symbol - length of each document at its beginning - end of file

Self-synchronizing codes  not useful or full-text retrieval  motivation - compressed text의 중간에서 decoding synchronizing cycle을 찾아 decoding - part of corrupteed, beginning is missing  fixed-length code : self-synchronizing 불가  Table 2.3  Figure 2.22

Performance comparisons  consideration - compression speed - compression performance - computing resource  Table 2.4

2. Text Compression

2. Text Compression

Presentation Transcript

Data Compression (2)

V.2 Index Compression

LUCAS 2 Compression Device

Speeding up pattern matching by text compression

Multimedia Compression (2)

On Compression-Based Text Classification

Text (Term 2)

Text compression

Text Compression

Computer Vision – Compression(2)

Lossless Compression(2)

2. Text Compression

New Compression Codes for Text Databases

Text Compression

Text Compression

Text Compression Huffman Coding

EDITED TEXT 2

Text 2

Language-Model Based Text-Compression

Multimedia Compression - 2

Text Chapters 2