210 likes | 375 Views
Exact and Approximate Pattern in the Streaming Model. Benny Porat and Ely Porat 2009 FOCS. Presented by - Tanushree Mitra. Problem Statement. Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n. Contributions.
E N D
Exact and Approximate Pattern in the Streaming Model Benny Porat and Ely Porat 2009 FOCS Presented by - Tanushree Mitra
Problem Statement • Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.
Contributions • Exact pattern matching - A fully online randomized algorithm for the classical pattern matching problem Time complexity - O(logm) per character that arrives Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time. • Approximate pattern matching – An algorithm for pattern matching with k mismatches problem. Time complexity - O(k2poly(logm)) per character Space complexity - O(k3poly(logm))
Applications • Monitoring Internet traffic • Computational Biology • Large Scale web searching • Viruses and Malware detection • Automatic Stock market analysis • Robotics
Background Brute Force Algorithm – • Slide the pattern along the text and • Compare it to the corresponding portion of the text Time Complexity – O(mn) Speedup possible in these 2 steps. • Sliding step speedup by pre-processing the pattern, • Knuth-Morris-Pratt algorithm • Boyer-Moore algorithm. • Ukkonen’s algorithm to construct suffix trees • Comparison step speedup • Rabin-Karp algorithm.
The Intuition • When Rabin-Karp’s algorithm is done with the i’th character, and advances tothe next position in the text, it does not use any of the information gathered. • The KMP algorithm, on the other hand, puts that information to good use. • Combine the key features of KMP and the Rabin-Karp algorithms to achieve an online algorithm that uses less space. The Idea
Definitions - Fingerprints Fingerprint String S ф(S) Sliding Fingerprint Polynomial Fingerprint q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp False Positives If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3
Definitions - PeriodPl • Period - A prefix Sp = s1,s2,….,sl of a string S is defined to be a period of S, iffsi= si+l, for 0 ≤ i≤ n - l • PeriodPl- For a pattern P= p1,p2,….,pm,prefix is, Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of Plis periodPl Put the information to good use • If Pl matches the test at a given index i, then there cannot be a match between i to i + |periodPl|
The Idea False Positives?? Slide over |periodPl| position that could be a match. Very LOW PROBABILITY of false positives • Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them? • Preprocessing phase – Calculate Sliding fingerprint on the pattern фp and on the shortest period фperiod p • Online phase – Slide fingerprint ф over the entire text. • While ф = фp,slide ф by | PeriodPl | characters • If we do not reach end of text abort Text and pattern should satisfy stringent restrictions
Go for subpatterns • Log m subpatterns p1, p2, p3, … pm-3, pm-2, pm-1, pm P1 pm pm-2 ,pm-1 pm-6,pm-5,pm-4,pm-3 P2 P4 p1, p2, p3, … pm/2 Pm/2 • Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.
Algorithm • Guidelines – • Find a position where Pi is a match, try to match Pi + 1 from the same starting point as Pi • If Pi + 1 does not match, use the information that Pi is a match. • Check in jumps of |periodPi|until there is no overlap with the area where Pi matches. PROCESS • Initialize an empty sliding fingerprint ф. • For each character that arrive: • Extend ф to include the new character • If |ф| = 2i andф = фifor some 0 ≤ i≤ log m. • If ф has at least |periodPi-1 | length overlaps with the last match, slide ф by |periodPi-1|characters. • Else, abort. What if there is a match that starts in substring of 1st process and ends in substring of 2nd process
Exact_PMfinal AlgorithmIntroduce Checkpoint Checkpoint - Start a new process in the last checkpoint of each process Algorithm • Preprocessing - • Initialize an empty sliding fingerprint ф. • For each 0 ≤ i≤ log m calculate the sliding fingerprint • фiof Pi and • фi,periodof the period of Pi
Final Algorithm – Online Phase • Online Phase – • Start a new process • For any character that arrive send it to all the processes • If some process aborts start new prorcess • If some process , A reaches to a checkpoint • Stop the ‘son process’ of A (if it has one) • Start a new ‘son process’ of A
Complexity • Space – • All fingerprints from preprocessing use O(log m) space. • Each process saves another fingerprint and there can be atmost log m processes in parallel • OVERALL usage – O(log m) space • Time – • Each process spends O(1) time for each new character that arrives • Each time there are at most 3 log m processes running (1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created) • OVERALL running time – O(log m) per character
Pattern Matching ( 1 – Mistmatch) • Partition the pattern and the text • We need to align every partition of the pattern Pqi,j to qi text shifts
Intuition • For each Pqi,j, run qi processes of Exact_PM. • Processqi,j,σ - σ’th process of the subpattern Pqi,j , for 0 ≤ σ < qi.This will try to match the Pqi,j to the text by considering the text as if it starts from the σ character. (τ mod qi = j –σ) • If for all qi, • numOfNotMatchqi,σ = 0 ‘match’. • numOfNotMatchqi,σ= 1, ‘exactly 1-mismatch’ • Otherwise, ‘more than 1-mismatch’.
Complexity • FACTS – • Run ∑li=1qi2 processes of Exact_PM • There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx • We have q1,q2, . . . ql groups of partitions. Each qi is a prime number • Space - O(log4m / log log m) • Time - O(log3m / log log m)
Pattern Matching ( k – Errors) • Preprocessing Phase – Initialize a process Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2, . . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi • Online Phase – Send τ character to each Processqi,j,σ such that τ mod qi = j –σ • d = all mismatches from all processes that return ‘exactly 1-mismatch’ • d > k more than k mismatches
Complexity • Space – • Run ∑i=1klogmqi2Є O(k3 log4m/ log logm) processes of 1-mismatch in parallel. • Each process requires log4m space. • OVERALL - O(k3poly(log m)) • Time – • Number of processes of 1-mismatch algorithm is bounded by ∑i=1klogmqi2Є O(k3 log4m/ log logm) • Running time of each character O(log3m) • OVERALL - O(k2poly(log m))
Concluding Discussion • The Two-Dimensional String-Matching Problem • The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc} • String matching with weighted mismatch