1 / 40

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

Introducing a novel algorithm for fully compressed pattern matching to efficiently identify patterns in compressed text without decompression. Explore various dictionary-based compression methods and their applications in simple collage systems. Discover how the FCPM algorithm extends existing techniques and its significance in pattern matching tasks.

vsomerville
Download Presentation

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga (Helsinki, Finland) Ayumi Shinohara (Kyushu, Japan) Masayuki Takeda (Kyushu, Japan)

  2. Fully Compressed Pattern Matching • Fully Compressed Pattern Matching • What is Fully Compressed Pattern Matching • Why FCPM? • Previous Studies on FCPM • Simple Collage Systems • Simple Straight Line Programs • FCPM Algorithm for SSLP • Further Work

  3. Classic Pattern Matching pattern: compress text: We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach [Amir94]. We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach [Amir94].

  4. Compressed Pattern Matching Compressed Document Files Document Files We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in aldoghqu3850pcxps;lafdjaeqw09bjzpafq05^@62:vzZIAPF’(90rwDEVcx0832nkvl;pzp99OPF:eDfja

  5. Compressed Pattern Matching pattern: compress geoiy083qa0gj(#*gpfomo)#(JGWRE$(U)%ARY)(JPED(A%RJG)ER%U)JGODAAQWT$JGWRE)$RJ)REWJFDOPIJKSeoiy083qa0gj(#*gpfomo)#(JGWRE$(U)%ARY)(JPED(A%RJG)ER%U)JGODAAQWT$JGWRE)$ geoiy083qa0gj(#*gpfomo)#(JGWRE$(U)%ARY)(JPED(A%RJG)ER%U)JGODAAQWT$JGWRE)$ geoiy083qa0gj(#*gpfomo)#(JGWRE$(U)%ARY)( compressed text:

  6. Fully Compressed Pattern Matching normal pattern matching algorithm uncompressed text uncompressed pattern compressed text compressed pattern matching algorithm uncompressed pattern compressed text fully compressed pattern matching algorithm compressed pattern

  7. Fully Compressed Pattern Matching Instance : T= compress(T) and P= compress(P). Query : Find all occurrences of pattern P in text T without decompressing either T or P. compressed text Pattern matching is a task to find all occurrences of pattern P in text T. It is one of the most fundamental problems in string processing. In the last decade, pattern matching on compressed objects has attracted more and more interests. geoiy083qa0gj(#*gpfomo)#(JGWRE$(U)%ARY)(JPED(A%RJG)ER%U)JGODAAQWT$JGWRE)$RJ)REWJFDOPIJRE)RE(SDKSL decompress compressed pattern decompress $%# matching

  8. But Why FCPM? text: I’m here. pattern: where.jpg wally.jpg

  9. year researchers compression 1995 SLP Karpinski, Rytter, Shinohara 1996 LZ77 Gasieniec, Karpinski, Plandowski, Rytter 1997 SLP Karpinski, Rytter, Shinohara 1997 SLP Miyazaki, Shinohara, Takeda 1999 LZW Gasieniec, Rytter 1999 deterministic automata Karhumäki, Plandowski, Rytter 2000 balanced SLP Hirao, Shinohara, Takeda, Arikawa 2002 Rytter LZ77 simple collage systems FCPM Algorithms simple collage systems 2004 Inenaga, Shinohara, Takeda

  10. Simple Collage Systems • Fully Compressed Pattern Matching • Simple Collage Systems • Collage Systems – Unifying Framework • LZW & LZ78 in Collage Systems • Hierarchy of Collage Systems • Our Result on Simple Collage Systems • Simple Straight Line Programs • FCPM Algorithm for Simple Collage Systems • Further Work

  11. LZ77 compressed text LZ78compressed text LZW compressed text Collage Systems A unifying framework for compressed pattern matching. T. Kida et al. (1999), SPIRE1999 Collage System compressed pattern matching algorithm for collage systems Previous specific algorithm for LZ77 LZ77 compressed text specific algorithm for LZ78 LZ78 compressed text specific algorithm for LZW LZW compressed text

  12. Definition of Collage Systems Collage System <D,S> D: sequence of assignmentsX1 = expr1; X2 = expr2;… ; Xh= exprh; Xk: variable, awherea (S {e}), Xi Xjwherei, j < k, exprk :( Xi ) jwherei < k, j N+, [ j ]Xiwherei < k, j N+, Xi [ j ]wherei < k, j N+. S: sequenceXi1 Xi2 ... Xisof variables inD. primitive assignment concatenation repetition prefix truncation suffix truncation ||D|| = h, |S| = s, |<D, S>| = ||D||+|S| = h+s

  13. a b a a b a b a b b a b a Example of Collage System X1= a; X2= b; a b X3= X1 X2 ; D : a a b X4= X1 X3 ; a b a X5= X3 X1 ; b b X6= X2 X2 ; ||D|| = 6 |S| = 5 S: X3X4 X5X6 X5

  14. LZW & LZ78 in Collage Systems LZW LZ78 S: Xi1 ... Xis D: X1 = a1 ; Xq = aq ; Xq+1 = Xi1Xs(i2) ; Xq+s-1 = Xis-1Xs(is) ; S: X1 ... Xs D: X0 = e ; X1 = Xi1b1 ; Xs = Xisbs ; S= {a1, a2, … aq}, 1 i1 q, s(j)denotes the integerk (1 k q) s.t. ak is the first symbol ofXj. bjis a symbol inS.

  15. Simple Collage System concatenation LZW X = XlXr LZ78 (|Xl|=1 or |Xr|=1) Regular Collage System SLP Sequitur Re-Pair concatenation MPM BPE X = XlXr Collage System Run-length LZSS concatenation repetition truncation LZ77 X = XlXr X = (Xi)j X = [j]Xi, X = Xi[j] Hierarchy of Collage Systems

  16. Given two simple collage systems <D,S> and <D’,S’> that are descriptions of textTand patternP, respectively. Occ(T,P)can be computed inO(||D||2 + mn log|S|)time using O(||D||2+mn)space. Previous best result: O(m2n2) time O(mn) space Our Result Occ(T,P) : set of all occurrences of P in T n = ||D|| + |S| (the size of <D,S>) m = ||D’|| + |S’|(the size of <D’,S’>) Miyazaki et al., 1997

  17. Our challenge: FCPM for SCS (including LZW) without any explicit decompression Gasieniec & Rytter’s Work • FCPM Algorithm for LZW • Claims O((m+n)log(m+n)) time • Actually decompresses a prefix of P when it is short (has length at most 2n) • Doesn’t really suit our problem setting

  18. Simple Straight Line Programs • Fully Compressed Pattern Matching • Simple Collage Systems • Simple Straight Line Programs • Straight Line Programs (SLPs) • Translation of SCS’s to SSLPs • Variables of SSLPs • FCPM Algorithm for Simple Collage Systems • Further Work

  19. T: sequence of assignments X1 = expr1; X2 = expr2;… ; Xn= exprn; Xk: variable, awherea (S {e}), Xi Xjwherei, j < k. exprk : Straight Line Program SLPT • SLP T is a CFG in Chomsky normal form. • SLP T for string w is a CFG s.t. L(T) = {w}.

  20. Straight Line Program Simple X1 = a; X2 = b; X3 = X2 X2; X4 = X1 X3; X5 = X3 X1; X6 = X2 X2; X7 = X3 X4; X8 = X7 X5; X9 = X8 X6; X10 = X9 X5; X1 = a; X2 = b; a b X3 = X1 X2 ; D : a a b X4 = X1 X3 ; a b a X5 = X3 X1 ; X6 = X2 X2 ; b b S: X3X4 X5X6 X5 Translation of SCS into SLP Simple Collage System

  21. Straight Line Program X10 Simple X1 = a; X2 = b; X3 = X2 X2; X4 = X1 X3; X5 = X3 X1; X6 = X2 X2; X7 = X3 X4; X8 = X7 X5; X9 = X8 X6; X10 = X9 X5; X9 X5 X8 X6 X3 X1 X7 X5 X2 X2 X1 X2 X3 X4 X3 X1 X1 X2 X3 X1 X2 X1 X1 X2 a b a a b a b a b b a b a Derivation Tree of SSLP X (= Xl Xr) is left simple if |Xl| =1.

  22. Straight Line Program X10 Simple X1 = a; X2 = b; X3 = X2 X2; X4 = X1 X3; X5 = X3 X1; X6 = X2 X2; X7 = X3 X4; X8 = X7 X5; X9 = X8 X6; X10 = X9 X5; X9 X5 X8 X6 X3 X1 X7 X5 X2 X2 X1 X2 X3 X4 X3 X1 X1 X2 X3 X1 X2 X1 X1 X2 a b a a b a b a b b a b a Derivation Tree of SSLP X (= Xl Xr) is right simple if |Xr| =1.

  23. Straight Line Program X10 Simple X1 = a; X2 = b; X3 = X2 X2; X4 = X1 X3; X5 = X3 X1; X6 = X2 X2; X7 = X3 X4; X8 = X7 X5; X9 = X8 X6; X10 = X9 X5; X9 X5 X8 X6 X3 X1 X7 X5 X2 X2 X1 X2 X3 X4 X3 X1 X1 X2 X3 X1 X2 X1 X1 X2 a b a a b a b a b b a b a Derivation Tree of SSLP X (= Xl Xr) is complex otherwise.

  24. FCPM Algorithm for Simple Collage Systems • Fully Compressed Pattern Matching • Simple Collage Systems • Simple Straight Line Programs • FCPM Algorithm for Simple Collage Systems • Reduction of Computing Occ(X,Y) • Efficient Computation by Dynamic Programming • Further Work

  25. Occ (X, Y) = { iOcc(X, Y) | |Xl| - |Y| i |Xl|} set of occurrences of Y that cover or touch the boundary of Xl and Xr. X Xl Xr Y Occ (X, Y) X: variable ofT Y: variable ofP

  26. Occ (X, Y) forms a single arithmetic progression. Property of Occ (X, Y) X O(1)space Xl Xr Y

  27. X ComputingOcc(X, Y) is reduced to computingOcc (X, Y). Xl Xr Y Y Y Text Decomposition Occ(X, Y) = Occ(Xl, Y) Occ (X, Y) Occ(Xr, Y) |Xl|

  28. Occ (T, P) Occ (Xn,Y1) Occ (X1,Y1) Occ (Xn,Yj) Occ (Xi,Yj) Occ (X2,Yj) Occ (Xi,Y1) Occ (X1,Yj) Occ (X1,Ym) Occ (X2,Ym) Occ (Xi,Ym) Occ (Xn,Ym) Occ (X2,Y1) Dynamic Programming for Occ (X,Y) Xn Xi X2 X1 Y1 Yj Ym

  29. Occl(X, Y) Occ (X, Y) = (Occ (X, Yl) (Occ(Xr, Yl) |Xl| |Yl|) (Occ(Xl, Yl) (Occ (X, Yr)) |Yl|)) Occr(X, Y) Pattern Decomposition X X Xl Xr Xl Xr Yl Yr Yl Yr Y Y

  30. Occl(X,Y) X Xl Xr Yl Yr

  31. Occl (X,Y) X Xl Xr Yl Yr

  32. Occl(X, Y) = Occ (X, Yl)(Occ (Xr, Yr, k) offset) Occl (X,Y) ? X O(1)time Xl Xr Yl Yr single arithmetic progression!! k

  33. Simple Variable Case X : any simple variable ofT Y : any variable ofP Occ (X, Y, k)can be computed inO(1)time with extra O(h2 + mh)work time andspace. h :the num. of simple variables inT m : the size ofP

  34. Complex Variable Case X : any complex variable ofT Y : any variable ofP Occ (X, Y, k)can be computed inO(logs)time with extra O(ms)work time andspace. s :the num. of complex variables inT m : the size ofP

  35. Xn Occ (Xn,Y1) Occ (Xn,Yj) Occ (Xn,Ym) Xi Occ (Xi,Y1) Occ (Xi,Yj) Occ (Xi,Ym) X2 Occ (X2,Y1) Occ (X2,Yj) Occ (X2,Ym) X1 Occ (X1,Y1) Occ (X1,Yj) Occ (X1,Ym) Y1 Yj Ym O(h2+mnlogs) = O(||D||2+mnlog|S|) time O(h2+mn) = O(||D||2+mn) space Final Result m O(mnlogs)time O(mn)space n O(logs) time O(1)space + extra work time and space: O(h2+mh) + O(ms) = O(h2+mn)

  36. Further Work • Fully Compressed Pattern Matching • Simple Collage Systems • Simple Straight Line Programs • FCPM Algorithm for Simple Collage Systems • Further Work • Multilevel Pattern Matching Code • Good Features of MPM Code • Faster FCPM Algorithm for MPM Code

  37. Multilevel Pattern Matching Code (Kieffer & Yang, 2000) MPM Code X1 = a; X2 = b; X3 = X1 X2; X4 = X1 X1; X5 = X2 X1; X6 = X2 X2; X7 = X3 X4; X8 = X5 X5; X9 = X6 X3; X10 = X7 X8; X11 = X9X2; X12 = X10X11; Further Work X12 X10 X11 X7 X8 X9 X3 X4 X5 X5 X6 X3 X1 X2 X1 X1 X2 X1 X2 X1 X2 X2 X1 X2 X2 a b a a b a b a b b a b a

  38. Features of MPM Code • Subclass of SLP • Runs in linear time • Exponentially small representation • Hierarchical structure

  39. Previous Best Result (Miyazaki et al. 1997) Our New Result O(m2n2)time O(mn)space O(mn2)time O(mn)space FCPM on MPM Code Submitted to DLT’04

More Related