Shift-And Approach to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matchingin LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Setsuo ARIKAWA Department of Informatics Kyushu University, Japan

Motivation • The available storage devices are limited! • I am eager to stuff any available information up to possible! • I want to do pattern matching as fast as possible! ...Yes! Data compression! ...but a suffix trie is very large... Phone numbers Electronic book Address book Memo Dictionary E-mail Schedule Database Motivation

Our goal Pattern Matching Machine decompress Original Text Compressed Text New Machine ! Compressed Text Our goal

Previous researches year researchers compression method 1988 Eliam-Tsoreff and Vishkin run-length 1992 Amir, Landau, and Vishkin two-dimensional run-length 1992 Amir and Benson two-dimensional run-length 1994 Amir, Benson, and Farach two-dimensional run-length 1994 Manber original compression scheme 1995 Farach and Thorup LZ77 1996 Gasieniec, et al. LZ77 1998 1996 Amir, Benson and Farach Kida, et al. LZW LZW 1997 Karpinski, Rytter, and Shinohara straight-line programs 1997 Miyazaki, Shinohara, and Takeda straight-line programs 1997 Takeda finite state encoding 1998 Fukamachi, Shinohara, and Takeda Huffman encoding 1998 Shibata byte pair encoding AC automaton DCC’98 Previous researches

Recent researches year researchers compression method 1999 Shibata, Takeda, Shinohara, and Arikawa Antidictionaries 1999 1998 1999 de Moura, Navarro, Ziviani, and Baeza-Yates Navarro and Raffinot Kida, Takeda, Shinohara, and Arikawa Word based encoding LZ family LZW 1999 Shibata, et al. Byte pair encoding 1999 Kida, et al. Dictionary based methods (Collage system) Shift-And algorithm CPM’99 CPM’99 CPM’99 SPIRE’99 Previous researches

Our main results • The new algorithmscans a compressed text in O(n+r) time using O(|D|)space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing. • The algorithm is about 1.3 times faster than our previous one which simulates the AC automaton. • The algorithm is about 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. |D| : size of the dictionary trie n : compressed text length m : pattern length r : number of pattern occurrences Main results

Lempel-Ziv-Welch Compression how to compress and decompress

Lempel-Ziv-Welch(LZW) compression 0 a a c b 1 2 3 a c a b b 4 5 9 10 a a a b b 6 6 8 7 12 1 2 4 4 5 2 3 6 9 b 11 Dictionary trie 11 a b ab ab ba b c aba bc abab aba Original text: 6 Compressed text: O(|D|) = O(n) LZW compression

How to compress a text 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 a b ab ab ba b c aba bc abab Original text: 1 2 4 4 5 2 3 6 9 Compressed text: 11 Dictionary trie Move of compression

How to decompress a compressed text 1 2 4 4 5 2 3 6 9 11 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 Compressed text: a b ab ab ba b c aba bc abab Original text: O(N) time O(n) time Dictionary trie Move of decompression

Compressed Pattern Matchingin LZW Compressed Text with Shift-And approach

Shift-And approach to pattern matching mask bits abc a a b a c 0 a a 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 1 0 a a 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 b b 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 a a 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 c c 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 & 0 0 (Baeza-Yates and Gonnet[1992], Wu and Manber[1992]) aabac pattern: text: aabaacaabacab 1 Pattern was found! Shift-And approach to pattern matching

Properties of Shift-And approach • Simple, but very fast when a pattern length m isnot greater than the word length of typical computers (32 or 64). • Assuming m32(or 64) and that bit-shift operations and bitwise logical operations on integers can be performed in constant time, it runs in O(n) time. • This method has many variations • generalized pattern matching • pattern matching with k-mismatch • pattern matching for multiple patterns Property of SA approach

Basic idea of our algorithm aabac pattern: abc a b a c b c a a a a a mask bits text: aabaacaabacab a 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 a 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 b 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 a 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 c 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Jump! Jump! compressed text : 1 6 15 a O(1) time? Basic idea

Basic idea of our algorithm aabac pattern: abc mask bits text: aabaacaabacab a 1 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 a 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 b 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 a 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 c 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Pattern was found! compressed text : 1 6 15 We need a mechanism for reporting all pattern occurrences. Basic idea

Technical details Lemma 1 (Realization of ‘Jump’) The state transition function can be realized inO(|D|+m) time using O(|D|) space, and return the value in O(1) time. Lemma 2 (Realization of ‘Output ’) The procedure which enumerates the pattern occurrences can be realized in O(|D|+m)time using O(|D|) space, and run in O(r) time. |D| : size of the dictionary trie m : pattern length r : number of pattern occurrences Main results

Overview of the algorithm Input. pattern P, u1,u2, …,un: LZW compressed text. Output. All occurrences of the patterns. Construct mask bits from P. Initialize the dictionary trie, M, U, and V; l:=0; S:=; fori:=1 tondo begin for eachdOutput(S, ui)do report ‘pattern occurs at position l+d ’; S:= f (S, u); /* Jump the state! */ l:= l+ |ui|; /* increment the offset */ Update the dictionary trie, M, U, and V; end ＾＾＾ Overview of the algorithm

Detail of our Algorithm Realization of Jump and Output

Detail of ‘Jump’ for a ∈Σ, u ∈Σ*, and S∈{1,・・・, m}, • abc a 1 1 0 0 1 1 a 0 0 1 1 0 0 b mask bits state S={1,3} 1 0 0 1 1 0 0 a M(a)={1,2,4} 1 0 0 0 1 0 c M(b)={3} 0 1 0 0 0 0 M(c)={5} 1 0 & 0 state transition 0 M(a) : { 1i  m | Pattern[i] = a } f (S, a) : ((S 1)∪{1}) ∩ M(a) AND bit shift OR Detail of ‘Jump’

Detail of ‘Jump’ for a ∈Σ, u ∈Σ*, and S∈{1,・・・, m}, • ＾ f (S,ε) :S f (S, ua) :f ( f (S, u), a) define recursively ＾＾＾＾ M(u) : f({1,・・・, m}, u) ＾＾ f (S, u) = ((S  |u|)∪{1,・・・, |u|}) ∩ M(u) M(a) : { 1i  m | Pattern[i] = a } f (S, a) : ((S 1)∪{1}) ∩ M(a) O(1) Detail of ‘Jump’

＾ Move of f (S, u) text: aabaacaabacab 0 0 a a 1 1 1 0 1 1 1 0 1 1 0 1 0 1 1 1 0 0 a a 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 b b ＾ M(u) 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 a a 1 aba acaabac 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 c c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 & 0 0 Move of ‘Jump’

＾ Move of f (S, u) text: aabaacaabacab 0 0 a a 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 a a 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 b b ＾ M(u) 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 a a aba acaabac 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 c c 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 & 1 1 Move of ‘Jump’

＾ How to calculate M(u) ＾＾ M(u  a)= f({1,・・・, m}, u  a) ＾ = f ( f({1,・・・, m}, u), a ) ＾ = f ( M(u), a ) u ＾ = ((M(u) 1)∪{1})∩M(a) ＾ M(u) a ＾ M(u  a) O(1) u  a total: O(|D|) time and space Dictionary trie D Detail of updating Mhat(u)

How to enumerate the occurrences u S pattern occurrence pattern occurrence 2 11 Output(S, u) = { 1 j |u| |m∈S } length i prefix of the pattern for the largest i∈S. 2{1, ...,m}D Output(S, u) ={ 2, 11} Detail of Output(S,u)

Realization of Output(S, u) Output(S, u) =((m S) U(u)) V(u) u S U(u) : {1 j |u| |i < mand u[1..i]=Pattern[m-i+1..m]} V(u) : {1 j |u| |i mand u[1-m+1..i]=Pattern} dependent onS independent of S Two subset U and A

How to calculate U(u) and V(u) ＾ u ifm∈M(ua) then U(ua) = U(u) {|u  a|} else U(ua) = U(u) ; a O(1) U(u) V(u) We can deal with V(n) as the same way of [DCC’98]. u  a total: O(|D|) time and space U(ua) V(ua) Dictionary trie D Detail of updating U and A

But... Is it really fast ? Uhmm.... -- Is this really practical? --

Experimental Comparisons Decompress ! ◆ Method 1: Shift-And Compressed Text bcbababc 9 ◆ Method 2: Compressed Text Our previous algorithm(DCC’98) ◆ Method 3: Compressed Text Our new algorithms Experimentation

Experimental Comparisons Original Text "The Brown corpus" 6.8 Mbytes Compressed Text 3.4 Mbytes compress (UNIX command) Language: C (with gcc compiler) Machine : Sun SPARCstation 20 with remote disk storage File transfer ratio: 0.96 Mbyte/sec Experimentation

Experimental results CPU time(s) Method elapsed time(s) Shift-And with decompression 7.52 8.16 1.5 times faster! Our previous algorithm(DCC’98) 6.57 7.31 1.3 times faster! New algorithm 5.15 6.05 CPU time + File I/O time uncompressed text Shift-And Experimental results

Experimental results CPU time(s) Method elapsed time(s) Shift-And with decompression 7.52 8.16 Our previous algorithm(DCC’98) 6.57 7.31 New algorithm 5.15 6.05 Shift-And in original text 9.36 3.09 Experimental results

Conclusion • The proposed algorithmscans an LZW compressed text in O(n+r) time using O(|D|)space, and reports all occurrences of the pattern after an O(m+||) time and O(||) space preprocessing. • Weimplementedthe algorithm, and showed that it is approximately 1.3 times fasterthan our previous algorithm. • Our new algorithm has several extensions. • generalized pattern matching • pattern matching with k-mismatches • pattern matching for multiple patterns Conclusion

Shift-And Approach to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matching in LZW Compressed Text

Presentation Transcript

Faster Approximate String Matching over Compressed Text

Pattern Matching

Pattern Matching

A Unifying Framework for Compressed Pattern Matching

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic

Pattern Matching on Compressed Texts II

Pattern Matching

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic

Speeding up pattern matching by text compression

Fastest Approach to Exact Pattern Matching

Pattern Matching

Shift-based Pattern Matching for Compressed Web Traffic

Pattern Matching

Dynamic Text and Static Pattern Matching

Pattern matching

Multiple Pattern Matching in LZW Compressed Text

Pattern Matching

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching

Dynamic Text and Static Pattern Matching