1 / 21

Exact and Approximate Pattern in the Streaming Model

Exact and Approximate Pattern in the Streaming Model. Benny Porat and Ely Porat 2009 FOCS. Presented by - Tanushree Mitra. Problem Statement. Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n. Contributions.

aysel
Download Presentation

Exact and Approximate Pattern in the Streaming Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exact and Approximate Pattern in the Streaming Model Benny Porat and Ely Porat 2009 FOCS Presented by - Tanushree Mitra

  2. Problem Statement • Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.

  3. Contributions • Exact pattern matching - A fully online randomized algorithm for the classical pattern matching problem Time complexity - O(logm) per character that arrives Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time. • Approximate pattern matching – An algorithm for pattern matching with k mismatches problem. Time complexity - O(k2poly(logm)) per character Space complexity - O(k3poly(logm))

  4. Applications • Monitoring Internet traffic • Computational Biology • Large Scale web searching • Viruses and Malware detection • Automatic Stock market analysis • Robotics

  5. Background Brute Force Algorithm – • Slide the pattern along the text and • Compare it to the corresponding portion of the text Time Complexity – O(mn) Speedup possible in these 2 steps. • Sliding step speedup by pre-processing the pattern, • Knuth-Morris-Pratt algorithm • Boyer-Moore algorithm. • Ukkonen’s algorithm to construct suffix trees • Comparison step speedup • Rabin-Karp algorithm.

  6. Quick History

  7. The Intuition • When Rabin-Karp’s algorithm is done with the i’th character, and advances tothe next position in the text, it does not use any of the information gathered. • The KMP algorithm, on the other hand, puts that information to good use. • Combine the key features of KMP and the Rabin-Karp algorithms to achieve an online algorithm that uses less space. The Idea

  8. Definitions - Fingerprints Fingerprint String S ф(S) Sliding Fingerprint Polynomial Fingerprint q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp False Positives If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3

  9. Definitions - PeriodPl • Period - A prefix Sp = s1,s2,….,sl of a string S is defined to be a period of S, iffsi= si+l, for 0 ≤ i≤ n - l • PeriodPl- For a pattern P= p1,p2,….,pm,prefix is, Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of Plis periodPl Put the information to good use • If Pl matches the test at a given index i, then there cannot be a match between i to i + |periodPl|

  10. The Idea False Positives?? Slide over |periodPl| position that could be a match. Very LOW PROBABILITY of false positives • Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them? • Preprocessing phase – Calculate Sliding fingerprint on the pattern фp and on the shortest period фperiod p • Online phase – Slide fingerprint ф over the entire text. • While ф = фp,slide ф by | PeriodPl | characters • If we do not reach end of text abort Text and pattern should satisfy stringent restrictions

  11. Go for subpatterns • Log m subpatterns p1, p2, p3, … pm-3, pm-2, pm-1, pm P1 pm pm-2 ,pm-1 pm-6,pm-5,pm-4,pm-3 P2 P4 p1, p2, p3, … pm/2 Pm/2 • Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.

  12. Algorithm • Guidelines – • Find a position where Pi is a match, try to match Pi + 1 from the same starting point as Pi • If Pi + 1 does not match, use the information that Pi is a match. • Check in jumps of |periodPi|until there is no overlap with the area where Pi matches. PROCESS • Initialize an empty sliding fingerprint ф. • For each character that arrive: • Extend ф to include the new character • If |ф| = 2i andф = фifor some 0 ≤ i≤ log m. • If ф has at least |periodPi-1 | length overlaps with the last match, slide ф by |periodPi-1|characters. • Else, abort. What if there is a match that starts in substring of 1st process and ends in substring of 2nd process

  13. Exact_PMfinal AlgorithmIntroduce Checkpoint Checkpoint - Start a new process in the last checkpoint of each process Algorithm • Preprocessing - • Initialize an empty sliding fingerprint ф. • For each 0 ≤ i≤ log m calculate the sliding fingerprint • фiof Pi and • фi,periodof the period of Pi

  14. Final Algorithm – Online Phase • Online Phase – • Start a new process • For any character that arrive send it to all the processes • If some process aborts start new prorcess • If some process , A reaches to a checkpoint • Stop the ‘son process’ of A (if it has one) • Start a new ‘son process’ of A

  15. Complexity • Space – • All fingerprints from preprocessing use O(log m) space. • Each process saves another fingerprint and there can be atmost log m processes in parallel • OVERALL usage – O(log m) space • Time – • Each process spends O(1) time for each new character that arrives • Each time there are at most 3 log m processes running (1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created) • OVERALL running time – O(log m) per character

  16. Pattern Matching ( 1 – Mistmatch) • Partition the pattern and the text • We need to align every partition of the pattern Pqi,j to qi text shifts

  17. Intuition • For each Pqi,j, run qi processes of Exact_PM. • Processqi,j,σ - σ’th process of the subpattern Pqi,j , for 0 ≤ σ < qi.This will try to match the Pqi,j to the text by considering the text as if it starts from the σ character. (τ mod qi = j –σ) • If for all qi, • numOfNotMatchqi,σ = 0 ‘match’. • numOfNotMatchqi,σ= 1, ‘exactly 1-mismatch’ • Otherwise, ‘more than 1-mismatch’.

  18. Complexity • FACTS – • Run ∑li=1qi2 processes of Exact_PM • There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx • We have q1,q2, . . . ql groups of partitions. Each qi is a prime number • Space - O(log4m / log log m) • Time - O(log3m / log log m)

  19. Pattern Matching ( k – Errors) • Preprocessing Phase – Initialize a process Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2, . . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi • Online Phase – Send τ character to each Processqi,j,σ such that τ mod qi = j –σ • d = all mismatches from all processes that return ‘exactly 1-mismatch’ • d > k more than k mismatches

  20. Complexity • Space – • Run ∑i=1klogmqi2Є O(k3 log4m/ log logm) processes of 1-mismatch in parallel. • Each process requires log4m space. • OVERALL - O(k3poly(log m)) • Time – • Number of processes of 1-mismatch algorithm is bounded by ∑i=1klogmqi2Є O(k3 log4m/ log logm) • Running time of each character O(log3m) • OVERALL - O(k2poly(log m))

  21. Concluding Discussion • The Two-Dimensional String-Matching Problem • The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc} • String matching with weighted mismatch

More Related