1 / 23

Multiple Pattern Matching in LZW Compressed Text

Nagano. Fukuoka. Multiple Pattern Matching in LZW Compressed Text. Takuya KIDA. Masayuki TAKEDA. Masayuki TAKEDA. Ayumi SHINOHARA. Ayumi SHINOHARA. Masamichi MIYAZAKI. Masamichi MIYAZAKI. Setsuo ARIKAWA. Setsuo ARIKAWA. Department of Informatics Kyushu University, Japan.

lilah
Download Presentation

Multiple Pattern Matching in LZW Compressed Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nagano Fukuoka Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Masayuki TAKEDA Ayumi SHINOHARA Ayumi SHINOHARA Masamichi MIYAZAKI Masamichi MIYAZAKI Setsuo ARIKAWA Setsuo ARIKAWA Department of Informatics Kyushu University, Japan

  2. Our Goal Pattern Matching Machine Original Text Compressed Text New Machine ! Compressed Text

  3. Previous studies year researcher compression method run-length two-dimensional run-length LZ77 LZW straight-line programs Eilam-Tsoreff and Vishkin Amir, Landau, and Vishikin Amir and Benson Farach and Thorup Gasieniec, et al. Amir, Benson and Farach Karpinski, et al. Miyazaki, et al. 1988 1992 1992 1995 1996 1996 1997 1997

  4. Previous result vs Our result • Amir, Benson, and Farach's algorithm (JCSS 1996)"Let sleeping files lie: Pattern matching in Z-compressed files" • deals with only single pattern. • can find only the first occurrence of the pattern. • takes O(n+m2) time and space. n : length of the compressed text,m: length of the pattern. • Our algorithm • deals with multiple patterns. • can find all occurrences of the patterns. • takes O(n+m2+r) time and O(n+m2) space.m: total length of the patterns, r: number of pattern occurrences.

  5. 0 a c b 1 2 3 a c a b 4 5 9 10 a a b b 6 8 7 12 b 11 Lempel-Ziv-Welch compression Dictionary trie : DΣ= {a,b,c} O( |D| ) = O( n ) originaltext a b ab ab ba b c aba bc abab 1 2 4 4 5 2 3 6 9 11 compressed text

  6. a b a b -1 0 {abab} 1 2 3 4 Basic Idea(Amir et al.) KMP automaton Pattern:abab Σ : goto function { } : output : failure function original text: a a b a b a a b b a b a b b a b a b a b b a a a a b b a a b a b b a b a b a b a b a b a b a b a a b b b a a b a b b found ! found !

  7. a b a b -1 0 0 {abab} 1 1 2 2 3 3 4 4 ab, bab bc a b b c bca, a b ca, ba {abab} aba abab Basic Idea(Amir et al.) KMP automaton Pattern:abab Next (0, bab)=2

  8. abc {abab} 0 a a b b ab abc 1 2 3 4 Basic Idea(Amir et al.) Next (2, abc)=0 Output (2, abc)= { 〈2, abab〉 } Who is watching the occurrences of the pattern?!

  9. 0 1 2 3 c a b b a b 4 5 9 {ababb,bb} {aba} 6 8 c 7 a {abca} b b {bb} for Multiple Patterns • Aho-Corasick Pattern Matching Machine Patterns:Π={aba,ababb,abca,bb} : goto function : failure function { } : output

  10. Our Algorithm Input. Π : set of patterns, u1,u2, …,un: LZW compressed text . Output. All occurrences of the patterns. Construct from Π the AC machine, and the generalized suffix trie. Initialize the dictionary trie, Next and Output ; l:=0; state:=q0; for i:=1 to n do begin for each〈d ,π〉∈ Output(state,ui) do report "pattern π occurs at position l+d"; state:=Next(state,ui); l:= l+ |ui|; Update the dictionary trie, Next and Output end. O( m2) O( n ) O( n+r )

  11. Ok! Let’s go!

  12. N1(q, u)・u if u∈Factor(Π), = Next(0, u) otherwise. Next(q,u) State Transition Function Next (q, u) O( m×|D| ) !! Next: Q×D → Q Q: states of AC machine D: strings represented by dictionary trie m: total length of patterns O( m×m2) O( |D| )

  13. state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb 0 1 2 3 4 5 6 7 8 9 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 4 4 4 4 4 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 2 2 4 2 4 2 2 2 2 2 State Transition Function Next (q, u) O( |D|+m3 ) • Table of N1 (q, u)・u --- O(m×m2) Π={aba,ababb,abca,bb}

  14. a c b c a a b b a c a b b b b a b Π={aba,ababb,abca,bb} Generalized Suffix Trie O( m2 ) O( m ) : explicit node : nonexplicit node

  15. state state a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb a b cab ba bbbc ca aba abb abc babbca abab abca babb ababb 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1 1 3 1 3 1 7 1 1 1 1 1 3 1 3 1 7 1 1 1 8 2 9 4 5 9 8 2 9 9 8 2 9 4 5 9 8 2 9 9 0 0 6 0 6 0 0 0 0 0 0 0 6 0 6 0 0 0 0 0 1 3 1 3 1 1 1 3 1 1 1 3 1 3 1 1 1 3 1 1 9 9 9 5 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 0 6 0 6 0 0 0 6 0 0 0 6 0 6 0 0 0 6 0 0 1 1 7 1 7 1 1 1 1 1 1 1 7 1 7 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 9 9 5 9 5 9 9 9 9 9 9 9 5 9 5 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 4 2 4 2 2 2 4 2 2 2 4 2 4 2 2 2 4 2 2 1 7 1 7 1 1 1 7 1 1 1 7 1 7 1 1 1 7 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 9 5 9 5 9 9 9 5 9 9 9 5 9 5 9 9 9 5 9 9 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 2 4 2 4 2 2 2 2 2 2 2 4 2 4 2 2 2 2 2 State Transition Function Next (q, u) O( |D|+m2 ) O( |D|+m3 ) • Table of N1 (q, u)・u --- O( m×m )

  16. Ancestor(q, k): the ancestor of node q with distance k in the trie of AC machine. u : one of the explicit descendants of node u in the generalized suffix trie.

  17. q u i i i π Output Function Output(q,u)={〈i,π〉| 1≦i≦|u|, π∈Π, and π is a suffix of string q・u[1..i] } O( m×|D| ) !!!

  18. Let be the longest prefix of u such that is a suffix of some pattern. q u π2 ~ ~ ~ u u u π1 π1 π3 Output Function dependent on q independent of q O( |D| ) O(m2)

  19. But... Is it really fast ? Uhmm....

  20. Decompression ! Decompression ! Experiment ◆ Method 1: AC Machine Original Text Compressed Text ◆ Method 2: AC Machine Compressed Text bcbababc 9 ◆ Method 3: Without Decompression Our Algorithm Compressed Text

  21. Experiment Original Text "The Brown corpus" 6.8 Mbytes compress (UNIX command) Compressed Text 3.4 Mbytes Language: C++ (gcc without optimization) Machine : Sun SPARCstation 20.

  22. 30 Method 1 25 CPU time (s) Method 2 20 15 Method 3 10 5 0 0 5 10 15 20 25 Occurrence rate ( % ) (number of pattern occurrences / original text length) Result of the Experiment Our Algorithm

  23. takes O( n+m2 ) space can answer in O(n+m2+r) time Conclusion Previous Result Our Result deals with only single pattern deals with multiple patterns can find only the first occurrence of the pattern can find all occurrences of the patterns takes O( n+m2 ) time and space about twice faster than a decompression followed by using the AC machine no practical evaluation

More Related