1 / 35

Pattern Matching in the streaming model

Pattern Matching in the streaming model. Ely Porat Google inc & Bar-Ilan University. Problem definition - Pattern Matching. Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. T=. n. P=. m. Problem definition - Online Pattern Matching.

raine
Download Presentation

Pattern Matching in the streaming model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University

  2. Problem definition - Pattern Matching Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. T= n P= m

  3. Problem definition - Online Pattern Matching • We get the text character by character P= T=

  4. Motivation… • Stock market

  5. Motivation.. • Espionage The rest we monitor

  6. Motivation… • Viruses and malware Software solutions: Snort: 73.5Mb ClamAV: 1.48Gb Using TCAMs: Snort: 680Kb ClamAV: 25Mb Our solution (software): Snort: 51Kb ClamAV: 216Kb

  7. Motivation… • Monitoring internet traffic

  8. Streaming model 250 BPS 250 BPS We can't store the whole input In our case we seek for algorithm which require poly(log m) space

  9. Related work • Karp-Rabin: Randomized Algorithm for exact pattern matching • Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matching • Almost any pattern matching algorithm can be converted to run online.

  10. Karp-Rabin Algorithm p0p1p2p3...pm-1 p0rm-1+p1rm-2+p2rm-3+...+pm-1modq Choosing randomly r Si=tirm-1+ti+1rm-2+...ti+m-1modq t0 t1 t2 . . . ti ti+1  . . . ti+m-1 ti+m  . . . tn Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq Si+1=Sir+ti+m-tirm Require O(m) memory

  11. The idea - Simple case Signature The pattern start with z, and there is no more z's in the pattern P=  Z Signature Z Z T Signature Start signing Start signing

  12. Case 1 Signature There is a prefix U s.t U appear only once in the pattern P= U m =<m/2 Signature Seek in recursion U U T Signature Start signing Start signing

  13. Case 2: No small U Option 1 P= v v v v v v v v Prefix of v P= W Option 2 W Look on the first m/2 character They appear again somewhere P= v v v v w w isn't a prefix of v and v isn't a prefix of w v=<m/2

  14. Solving case 2 Option 2 P= v v v v w v=<m/2 Sign on w Search in recursion for v, and count how many time you found it Signature v v v v v T Signature Start signing Start signing

  15. Solving case 2 - continue Option 2 P= v v v v w v=<m/2 Sign on w Search in recursion for v, and count how many time you found it <m/2 Signature Signature v v v v v v v v T >m/2 Start signing Start signing Using O(log m) signatures and counters in the worst case

  16. Karp-Rabin Algorithm p0p1p2p3...pm-1 p0rm-1+p1rm-2+p2rm-3+...+pm-1modq Choosing randomly r Si=tirm-1+ti+1rm-2+...ti+m-1modq t0 t1 t2 . . . ti ti+1  . . . ti+m-1 ti+m  . . . tn Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq Si+1=Sir+ti+m-tirm

  17. Rothschildsignature 07 p0p1p2p3...pm-1 p0rm-1+p1rm-2+p2rm-3+...+pm-1modq p0+p1r+p2r2+...+pm-1rm-1modq t0 t1 t2 t3 . . . ti

  18. Forward signatures Signature There is a prefix U s.t U appear only once in the pattern P= U m =<m/2 Signature Seek in recursion U T Check if equal to X Remember X for this position Calculate X=Si+Sig*ri+1

  19. Example: q=7 r=3 5 P: 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 0, 1, 1, 0, 1, 1, 1 1 0, 1, 1 4 ri= 4 3 2 2 1 1 3 5 4 6 6 1 6 2 1 3 5 5 4 2 6 1 4 5 3 3 Level 1: Level 2: Level 3: T: 0 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 3 3 2 6 6 6 2 4 4 1 6 0 0 0 0 0 0 1 4 4 6 3 3 1 1 Level 3: Level 2: Level 1: 5 6 3 4 6 1 3 2 6 4 3 0 0 1 1

  20. Worst case - time t0 t1 t2 t3 . . . ti X1 Amortized O(1),but what about worst case? X2 Check using hash table Xlogm X1=X2=…=Xlogm ??? We can work in lazy approach without blowup in the memory Time: O(1)

  21. Average / Random/ Smooth case P: m log∑m log∑m log∑log∑m Total number of iteration is O(log* ∑m)

  22. Worst case P: m m/2 m/2 m/4 Total number of iteration is O(log m) = O(log m logδ) space.

  23. Multi-Pattern search (dictionary matching) • Given a set of patterns D={P1,P2,P3,…,Pd} • The patterns can be of different length • We will want to report whenever one of the patterns appear. • Our algorithm will require O(∑i=1dlog|Pi|) memory, and will require O(log d) time per text character.

  24. Multi-Pattern search (dictionary matching) • Denote M=maxi |Pi| • Our algorithm will have 2 cases: • Case 1: d>M • Case 2: d<M

  25. Case 1: d>M • In this case we can allocate an array of size M+1 t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl Sl-MSl-M+1 . . . Sl It is easy to maintain such a sliding window in O(1) time and O(M) memory

  26. Case 1: d>M - continue Example For each Pi in D: (Pi=a0 a1 a2 … ami-1) e=mi while e!=0: find j s.t 2j=<e and 2j+1>e e=e-2j if e!=0 HashTable(Sig(aeae+1…ami)) HashTable(Sig(a0a1…ami),matchi) Pi=a0 a1 a2 … a38 We will store in the hash table: Sig(a7a6…a38) Sig(a3a4…a38) Sig(a1,a2…a38) Sig(a0a1…a38),matchi We will store at most log |Pi| points

  27. Case 1: d>M - continue 2i 2i +2j 2i +2j +2l At most logPi levels

  28. Case 1: d>M • In this case we can allocate an array of size M+1 t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl Sl-MSl-M+1 . . . Sl Notice that it take O(1) to calculate Sig(titi+1…tl)

  29. Case 1: d>M - continue We will do binary search over the sliding window l-2j-1 l-2j l-2j-1-2j-2 Sl-M Sl-M+1 . . . Sl No Yes Is it in the HashTable? Is it in the HashTable? Is it in the HashTable?

  30. Case 2: d<M • In this case we will split our dictionary D into 2 dictionaries: • D1 – all the patterns shorter then d. On this dictionary we will run case 1. • D2 – all the patterns longer then d.We need only to deal with this case.

  31. Case 2: d<M - continue For each Pi in D2: Pi = a0 a1 a2 . . . ad-1 ad . . . am SPi=Sig(a0a1…ad-1) Store in hash table SPi

  32. Case 2: d<M - continue If Pi contain a period prefix of length more then d w.h.p won’t be SPi Pi = u u u u u u v . . am SPi SPi SPi We store as well the number of time we need to see SPi We will start a process which will seek for Pi only after seeing enough SPi. Therefore the minimum number of characters we have to see between 2 process of Pi is at least d.

  33. Case 2: d<M - continue • We run the algorithm from the beginning of the lecture. • Amortized it take O(1/d) per pattern per text character. • Overall it take O(1) amortized time per text character. • By lazy approach we get O(1) time in worst case.

  34. Open problems • Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd) • Improve case 1 to be O(1) • With heuristic almost all the dictionary take O(1) time, and O(1) space per pattern. • Lower bound • We believe that single pattern search lower bound is Ώ(log m log δ) • Supporting wildcards & mismatches

More Related