390 likes | 413 Views
Dynamic Text and Static Pattern Matching. Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University. Classical Pattern Matching. Output: locations of T where P appears.
E N D
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University
Classical Pattern Matching Output: locations of T where P appears. Input: - Pattern P = p1p2…pm - Text T = t1 t2 t3 . . . tn over alphabet Σ. • m is the PATTERN size. • nis the TEXT size.
Pattern Matching (eg.) T=aaagcattagctagcagcat Input: P=agca= {a,g,c,t}
Pattern Matching (eg.) T=aaagcattagctagcagcat Input: P=agca= {a,g,c,t} 3 , 13 , 16,… Output: 1 2 3 4 5 6 … 13. . . 16
“Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. • C. Dynamic Text and Static Pattern.
“Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time
“Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)
“Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time Time: query - O(m + log2n) change - O(log2n) • C. Dynamic Text and Static Pattern?
Dynamic Text and Static Pattern Matching • Pattern is non-changing • Text changes over time • Goal: report new occurrences of the pattern without performing a new search.
Motivation • Intrusion detection systems • 2. Info alerts • 3. Two-dimensional run-length compressed matching problem, [ALS03] FAX a14 a4b2c3d5 c8a6
Problem Definition • Input: T and P over Σ ={1, …, m}. • Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T. Change Operation:change one character in the text, e.g. location 5 from a to b.
Example • Input: P=agagagc= (ag)3c= {a,g,c,t}T = g a g a g c t a g c g a g c a t
Example • Input: P=agagagc= (ag)3c= {a,g,c,t}T = g a g a g c t a g a g a g c a t 10
Example • Input: P=agagagc= (ag)3c= {a,g,c,t}T = g a g a g c t a g a g a g c a t 8 10 • Output: {8}
Results After O(n log log m + ) preprocessing time, O(log log m) time per replacement.
“Dynamic” Pattern Matching • A. Static Text and Dynamic Pattern. • B. Dynamic Text and Dynamic Pattern. Time: O(n) preprocessing O(m) query time Time: query - O(m + log2n) change - O(log2n) • C.DynamicText and Static Pattern. Time: change and announce O(log log m)
Static Stage • To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77]. • All pattern occurrences in a text of length 2m can be stored in O(1) space.
Succinct Output Assumption: the text is of size 2m. (Break the text T into overlapping strings of length 2m-1. ) 1m 2m 3m 4m T P
Succinct Output (cont.) • P is periodic: A string p is periodicif it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabca Store the output as a ‘chain’ of pattern occurrences. • P is non-periodic: By definition, no more than two occurrences.
On-line Algorithm Following each replacement: • Delete old matches that are no longer pattern occurrences. • Find new matches.
Delete Old Matches Deleting is trivial since we store the matches in constant space: • P is periodic: Truncate the chain of pattern occurrences. • P is non-periodic: Discard all matches that are within distance -m of the replacement.
Find New Matches • Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?
Main Idea - Text Covers We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P. Pattern 1 2 3 4 5 6 7 = a g a g a g c Text = g a g a g c t a g c g a g c a t g a g a g c g a g c a g c a Cover: [ 2,7] [5,7] [4,7] [1,1]
Text Cover (cont.) The text cover must satisfy two properties: • Substring Property: each element of the cover is a substring of P, or a character not included in P. • Maximality Property: no two adjacent elements can concatenate to form a substring of P.
Text Cover (cont.) • Initially, in the static stage, we construct a text cover for T. • We ensure that the cover satisfies both the substring and maximality property. How does a replacement in the text affect the text cover?
Text Cover following replacement 1 2 3 4 5 6 7 Pattern = a g a g a g c Text = g a g a g c t a g c g a g c a t a g a g a g c, a g c, g a g c, a Cover: (2,7) - (5,7) (4,7) (1,1) - (2,7) -(5, 6)(1,1) (4,7) (1,1) - (1,3) (1,7)
Updating the Text Cover At most 5 pieces can violate the maximality property.
Substring Concatenation Query • Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P? • Query time:O(log log m). • Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.
Find New Matches • Given: a text cover which satisfies both the substring and maximality properties. • Find: all new locations of the pattern in the text.
Key Observations • A new match must begin within distance -m of the change. • A new match can include at mostone entire piece of the cover. • It can span at most three pieces of the cover.
Furthermore A new match can begin in one of at most three pieces of the cover: • the piece with the change • the previous piece • the one previous to that T P
Simplified Problem • Search starts within piece of cover. • Simple O(m) time algorithm: • Check each location in X for a pattern start. • Use suffix trees and LCA queries to compare substrings in constant time. X T P
Improved Algorithm • Really, we only have to check each suffix of X that is a pattern prefix. e.g. X = a g a g a • The KMP automaton can give the necessary information. However, the time is still O(m) !
Improved Algorithm • We can group the prefixes of P by their periods. • Each group of prefixes can be checked in constant time! • There are at most O(log m) groups.
Groups (eg.) 1 2 3 4 5 6 7 Pattern = a g a g a g c X = a g a g a There are three suffixes of X that are also pattern prefixes: { agaga, aga } { a } Prefixes with the same period fall into a single group.
Checking a group in Constant Time 1 2 3 4 5 6 7 Pattern = a g a g a g c a g a g a a g t . . . a g a g a g a g a g c g c . . . X = a g a g a Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’
Groups • A string cannot have more than O(log m) border groups. • Hence, the time of the algorithm is O(log m). [Intuition: each new group has a new period which has to be at least double the size of the old period. e.g.aagaagaa]
Even Better... • We check only a constant number of groups. • Choosing these O(1) groups takes O(log log m) time. • Hence, our algorithm takes O(log log m) time per replacement.
Open Problems • Allowing insertions and deletions to the text. • Searching for a set of multiple static patterns.