320 likes | 473 Views
Shift-And Approach to Pattern Matching in LZW Compressed Text. Takuya KIDA. Masayuki TAKEDA. Ayumi SHINOHARA. Setsuo ARIKAWA. Department of Informatics Kyushu University, Japan. Motivation. The available storage devices are limited! I am eager to stuff any available information
E N D
Shift-And Approach to Pattern Matchingin LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA Department of Informatics Kyushu University, Japan
Motivation • The available storage devices are limited! • I am eager to stuff any available information up to possible! • I want to do pattern matching as fast as possible! ...Yes! Data compression! ...but a suffix trie is very large... Phone numbers Electronic book Address book Memo Dictionary E-mail Schedule Database Motivation
Our goal Pattern Matching Machine decompress Original Text Compressed Text New Machine ! Compressed Text Our goal
Previous researches year researchers compression method 1988 Eliam-Tsoreff and Vishkin run-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gasieniec, et al. LZ77 1998 1996 Amir, Benson and Farach Kida, et al. LZW LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Fukamachi, Shinohara, and Takeda Huffman encoding 1998 Shibata byte pair encoding AC automaton DCC’98 Previous researches
Recent researches year researchers compression method 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries 1999 1998 1999 de Moura, Navarro, Ziviani, and Baeza-Yates Navarro and Raffinot Kida, Takeda, Shinohara, and Arikawa Word based encoding LZ family LZW 1999 Shibata, et al. Byte pair encoding 1999 Kida, et al. Dictionary based methods (Collage system) Shift-And algorithm CPM’99 CPM’99 CPM’99 SPIRE’99 Previous researches
Our main results • The new algorithmscans a compressed text in O(n+r) time using O(|D|)space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing. • The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton. • The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. |D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences Main results
Lempel-Ziv-Welch Compression how to compress and decompress
Lempel-Ziv-Welch(LZW) compression 0 a a c b 1 2 3 a c a b b 4 5 9 10 a a a b b 6 6 8 7 12 1 2 4 4 5 2 3 6 9 b 11 Dictionary trie 11 a b ab ab ba b c aba bc abab aba Original text: 6 Compressed text: O(|D|) = O(n) LZW compression
How to compress a text 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 a b ab ab ba b c aba bc abab Original text: 1 2 4 4 5 2 3 6 9 Compressed text: 11 Dictionary trie Move of compression
How to decompress a compressed text 1 2 4 4 5 2 3 6 9 11 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 Compressed text: a b ab ab ba b c aba bc abab Original text: O(N) time O(n) time Dictionary trie Move of decompression
Compressed Pattern Matchingin LZW Compressed Text with Shift-And approach
Shift-And approach to pattern matching mask bits abc a a b a c 0 a a 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 0 a a 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 b b 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 a a 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 c c 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 & 0 0 (Baeza-Yates and Gonnet[1992], Wu and Manber[1992]) aabac pattern: text: aabaacaabacab 1 Pattern was found! Shift-And approach to pattern matching
Properties of Shift-And approach • Simple, but very fast when a pattern length m isnot greater than the word length of typical computers (32 or 64). • Assuming m32(or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time. • This method has many variations • generalized pattern matching • pattern matching with k-mismatch • pattern matching for multiple patterns Property of SA approach
Basic idea of our algorithm aabac pattern: abc a b a c b c a a a a a mask bits text: aabaacaabacab a 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 a 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 b 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 a 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 c 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Jump! Jump! compressed text : 1 6 15 a O(1) time? Basic idea
Basic idea of our algorithm aabac pattern: abc mask bits text: aabaacaabacab a 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 a 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 b 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 a 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 c 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Pattern was found! compressed text : 1 6 15 We need a mechanism for reporting all pattern occurrences. Basic idea
Technical details Lemma 1 (Realization of ‘Jump’) The state transition function can be realized inO(|D|+m) time using O(|D|) space, and return the value in O(1) time. Lemma 2 (Realization of ‘Output ’) The procedure which enumerates the pattern occurrences can be realized in O(|D|+m)time using O(|D|) space, and run in O(r) time. |D| : size of the dictionary trie m : pattern length r : number of pattern occurrences Main results
Overview of the algorithm Input. pattern P, u1,u2, …,un: LZW compressed text. Output. All occurrences of the patterns. Construct mask bits from P. Initialize the dictionary trie, M, U, and V; l:=0; S:=; fori:=1 tondo begin for eachdOutput(S, ui)do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */ Update the dictionary trie, M, U, and V; end ^ ^ ^ Overview of the algorithm
Detail of our Algorithm Realization of Jump and Output
Detail of ‘Jump’ for a ∈Σ, u ∈Σ*, and S∈{1,・・・, m}, • abc a 1 1 0 0 1 1 a 0 0 1 1 0 0 b mask bits state S={1,3} 1 0 0 1 1 0 0 a M(a)={1,2,4} 1 0 0 0 1 0 c M(b)={3} 0 1 0 0 0 0 M(c)={5} 1 0 & 0 state transition 0 M(a) : { 1i m | Pattern[i] = a } f (S, a) : ((S 1)∪{1}) ∩ M(a) AND bit shift OR Detail of ‘Jump’
Detail of ‘Jump’ for a ∈Σ, u ∈Σ*, and S∈{1,・・・, m}, • ^ f (S,ε) :S f (S, ua) :f ( f (S, u), a) define recursively ^ ^ ^ ^ M(u) : f({1,・・・, m}, u) ^ ^ f (S, u) = ((S |u|)∪{1,・・・, |u|}) ∩ M(u) M(a) : { 1i m | Pattern[i] = a } f (S, a) : ((S 1)∪{1}) ∩ M(a) O(1) Detail of ‘Jump’
^ Move of f (S, u) text: aabaacaabacab 0 0 a a 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 0 a a 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 b b ^ M(u) 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 a a 1 aba acaabac 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 c c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 & 0 0 Move of ‘Jump’
^ Move of f (S, u) text: aabaacaabacab 0 0 a a 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 a a 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 b b ^ M(u) 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 a a aba acaabac 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 c c 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 & 1 1 Move of ‘Jump’
^ How to calculate M(u) ^ ^ M(u a)= f({1,・・・, m}, u a) ^ = f ( f({1,・・・, m}, u), a ) ^ = f ( M(u), a ) u ^ = ((M(u) 1)∪{1})∩M(a) ^ M(u) a ^ M(u a) O(1) u a total: O(|D|) time and space Dictionary trie D Detail of updating Mhat(u)
How to enumerate the occurrences u S pattern occurrence pattern occurrence 2 11 Output(S, u) = { 1 j |u| |m∈S } length i prefix of the pattern for the largest i∈S. 2{1, ...,m}D Output(S, u) ={ 2, 11} Detail of Output(S,u)
Realization of Output(S, u) Output(S, u) =((m S) U(u)) V(u) u S U(u) : {1 j |u| |i < mand u[1..i]=Pattern[m-i+1..m]} V(u) : {1 j |u| |i mand u[1-m+1..i]=Pattern} dependent onS independent of S Two subset U and A
How to calculate U(u) and V(u) ^ u ifm∈M(ua) then U(ua) = U(u) {|u a|} else U(ua) = U(u) ; a O(1) U(u) V(u) We can deal with V(n) as the same way of [DCC’98]. u a total: O(|D|) time and space U(ua) V(ua) Dictionary trie D Detail of updating U and A
But... Is it really fast ? Uhmm.... -- Is this really practical? --
Experimental Comparisons Decompress ! ◆ Method 1: Shift-And Compressed Text bcbababc 9 ◆ Method 2: Compressed Text Our previous algorithm(DCC’98) ◆ Method 3: Compressed Text Our new algorithms Experimentation
Experimental Comparisons Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes compress (UNIX command) Language: C (with gcc compiler) Machine : Sun SPARCstation 20 with remote disk storage File transfer ratio: 0.96 Mbyte/sec Experimentation
Experimental results CPU time(s) Method elapsed time(s) Shift-And with decompression 7.52 8.16 1.5 times faster! Our previous algorithm(DCC’98) 6.57 7.31 1.3 times faster! New algorithm 5.15 6.05 CPU time + File I/O time uncompressed text Shift-And Experimental results
Experimental results CPU time(s) Method elapsed time(s) Shift-And with decompression 7.52 8.16 Our previous algorithm(DCC’98) 6.57 7.31 New algorithm 5.15 6.05 Shift-And in original text 9.36 3.09 Experimental results
Conclusion • The proposed algorithmscans an LZW compressed text in O(n+r) time using O(|D|)space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing. • Weimplementedthe algorithm, and showed that it is approximately 1.3 times fasterthan our previous algorithm. • Our new algorithm has several extensions. • generalized pattern matching • pattern matching with k-mismatches • pattern matching for multiple patterns Conclusion