350 likes | 471 Views
Pattern Matching in the streaming model. Ely Porat Google inc & Bar-Ilan University. Problem definition - Pattern Matching. Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. T=. n. P=. m. Problem definition - Online Pattern Matching.
E N D
Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University
Problem definition - Pattern Matching Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. T= n P= m
Problem definition - Online Pattern Matching • We get the text character by character P= T=
Motivation… • Stock market
Motivation.. • Espionage The rest we monitor
Motivation… • Viruses and malware Software solutions: Snort: 73.5Mb ClamAV: 1.48Gb Using TCAMs: Snort: 680Kb ClamAV: 25Mb Our solution (software): Snort: 51Kb ClamAV: 216Kb
Motivation… • Monitoring internet traffic
Streaming model 250 BPS 250 BPS We can't store the whole input In our case we seek for algorithm which require poly(log m) space
Related work • Karp-Rabin: Randomized Algorithm for exact pattern matching • Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matching • Almost any pattern matching algorithm can be converted to run online.
Karp-Rabin Algorithm p0p1p2p3...pm-1 p0rm-1+p1rm-2+p2rm-3+...+pm-1modq Choosing randomly r Si=tirm-1+ti+1rm-2+...ti+m-1modq t0 t1 t2 . . . ti ti+1 . . . ti+m-1 ti+m . . . tn Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq Si+1=Sir+ti+m-tirm Require O(m) memory
The idea - Simple case Signature The pattern start with z, and there is no more z's in the pattern P= Z Signature Z Z T Signature Start signing Start signing
Case 1 Signature There is a prefix U s.t U appear only once in the pattern P= U m =<m/2 Signature Seek in recursion U U T Signature Start signing Start signing
Case 2: No small U Option 1 P= v v v v v v v v Prefix of v P= W Option 2 W Look on the first m/2 character They appear again somewhere P= v v v v w w isn't a prefix of v and v isn't a prefix of w v=<m/2
Solving case 2 Option 2 P= v v v v w v=<m/2 Sign on w Search in recursion for v, and count how many time you found it Signature v v v v v T Signature Start signing Start signing
Solving case 2 - continue Option 2 P= v v v v w v=<m/2 Sign on w Search in recursion for v, and count how many time you found it <m/2 Signature Signature v v v v v v v v T >m/2 Start signing Start signing Using O(log m) signatures and counters in the worst case
Karp-Rabin Algorithm p0p1p2p3...pm-1 p0rm-1+p1rm-2+p2rm-3+...+pm-1modq Choosing randomly r Si=tirm-1+ti+1rm-2+...ti+m-1modq t0 t1 t2 . . . ti ti+1 . . . ti+m-1 ti+m . . . tn Si+1=ti+1rm-1+...ti+m-1r+ti+mmodq Si+1=Sir+ti+m-tirm
Rothschildsignature 07 p0p1p2p3...pm-1 p0rm-1+p1rm-2+p2rm-3+...+pm-1modq p0+p1r+p2r2+...+pm-1rm-1modq t0 t1 t2 t3 . . . ti
Forward signatures Signature There is a prefix U s.t U appear only once in the pattern P= U m =<m/2 Signature Seek in recursion U T Check if equal to X Remember X for this position Calculate X=Si+Sig*ri+1
Example: q=7 r=3 5 P: 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 0, 1, 1, 0, 1, 1, 1 1 0, 1, 1 4 ri= 4 3 2 2 1 1 3 5 4 6 6 1 6 2 1 3 5 5 4 2 6 1 4 5 3 3 Level 1: Level 2: Level 3: T: 0 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 3 3 2 6 6 6 2 4 4 1 6 0 0 0 0 0 0 1 4 4 6 3 3 1 1 Level 3: Level 2: Level 1: 5 6 3 4 6 1 3 2 6 4 3 0 0 1 1
Worst case - time t0 t1 t2 t3 . . . ti X1 Amortized O(1),but what about worst case? X2 Check using hash table Xlogm X1=X2=…=Xlogm ??? We can work in lazy approach without blowup in the memory Time: O(1)
Average / Random/ Smooth case P: m log∑m log∑m log∑log∑m Total number of iteration is O(log* ∑m)
Worst case P: m m/2 m/2 m/4 Total number of iteration is O(log m) = O(log m logδ) space.
Multi-Pattern search (dictionary matching) • Given a set of patterns D={P1,P2,P3,…,Pd} • The patterns can be of different length • We will want to report whenever one of the patterns appear. • Our algorithm will require O(∑i=1dlog|Pi|) memory, and will require O(log d) time per text character.
Multi-Pattern search (dictionary matching) • Denote M=maxi |Pi| • Our algorithm will have 2 cases: • Case 1: d>M • Case 2: d<M
Case 1: d>M • In this case we can allocate an array of size M+1 t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl Sl-MSl-M+1 . . . Sl It is easy to maintain such a sliding window in O(1) time and O(M) memory
Case 1: d>M - continue Example For each Pi in D: (Pi=a0 a1 a2 … ami-1) e=mi while e!=0: find j s.t 2j=<e and 2j+1>e e=e-2j if e!=0 HashTable(Sig(aeae+1…ami)) HashTable(Sig(a0a1…ami),matchi) Pi=a0 a1 a2 … a38 We will store in the hash table: Sig(a7a6…a38) Sig(a3a4…a38) Sig(a1,a2…a38) Sig(a0a1…a38),matchi We will store at most log |Pi| points
Case 1: d>M - continue 2i 2i +2j 2i +2j +2l At most logPi levels
Case 1: d>M • In this case we can allocate an array of size M+1 t0 t1 t2 t3 . . . tl-M tl-M+1 . . . tl Sl-MSl-M+1 . . . Sl Notice that it take O(1) to calculate Sig(titi+1…tl)
Case 1: d>M - continue We will do binary search over the sliding window l-2j-1 l-2j l-2j-1-2j-2 Sl-M Sl-M+1 . . . Sl No Yes Is it in the HashTable? Is it in the HashTable? Is it in the HashTable?
Case 2: d<M • In this case we will split our dictionary D into 2 dictionaries: • D1 – all the patterns shorter then d. On this dictionary we will run case 1. • D2 – all the patterns longer then d.We need only to deal with this case.
Case 2: d<M - continue For each Pi in D2: Pi = a0 a1 a2 . . . ad-1 ad . . . am SPi=Sig(a0a1…ad-1) Store in hash table SPi
Case 2: d<M - continue If Pi contain a period prefix of length more then d w.h.p won’t be SPi Pi = u u u u u u v . . am SPi SPi SPi We store as well the number of time we need to see SPi We will start a process which will seek for Pi only after seeing enough SPi. Therefore the minimum number of characters we have to see between 2 process of Pi is at least d.
Case 2: d<M - continue • We run the algorithm from the beginning of the lecture. • Amortized it take O(1/d) per pattern per text character. • Overall it take O(1) amortized time per text character. • By lazy approach we get O(1) time in worst case.
Open problems • Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd) • Improve case 1 to be O(1) • With heuristic almost all the dictionary take O(1) time, and O(1) space per pattern. • Lower bound • We believe that single pattern search lower bound is Ώ(log m log δ) • Supporting wildcards & mismatches