330 likes | 619 Views
Joint Advanced Student School 2004 Complexity Analysis of String Algorithms Sequential Pattern Matching : Analysis of Knuth-Morris-Pratt type algorithms using the Subadditive Ergodic Theorem 03 April 2014. Overview. Pattern Matching Sequential Algorithms Knuth-Morris-Pratt-Algorithm
E N D
Joint Advanced Student School 2004Complexity Analysis of String Algorithms Sequential Pattern Matching:Analysis of Knuth-Morris-Pratt type algorithms using the Subadditive Ergodic Theorem 03 April 2014
Overview • Pattern Matching • Sequential Algorithms • Knuth-Morris-Pratt-Algorithm • Probabilistic tools • Subadditive Ergodic Theorem • Martingales and Azuma's Inequality • Analysis of KMP-Algorithms • Properties of KMP • Establishing subadditivity • Analysis
Pattern Matching • Text , pattern • Comparison: • Alignment Position:for some k. Pattern-text comparison: M(l,k)=1 Pattern p abcde Text t xxxxxabxxxabcxxxabcde Alignment position AP
Sequential Algorithms - Definition • Semi-sequential: AP are non-decreasing. • Strongly semi-sequential: (i) and comparisons define non-decreasing text positions . • Sequential: (i) and • Strongly sequential: (i), (ii) and (iii) abcde Text is compared only if following a prefix of the pattern. Example: xxxxxabxxxabcxxxabcde
Example: Naive / brute force algorithm • Every text position is alignment position. • Text is scanned until... • pattern is found - then done. • mismatch occurs - then shift by one and retry. • Sequential algorithm. +1 +1 abcde +1 abcde abcde xxxxxabxxxabcxxxabcde
Knuth-Morris-Pratt type algorithms (1) • Idea: (Morris-Pratt) Disreagard APs already known not to be followed by a prefix of p. • Knowledge: • Already processed pattern • Pre-processing of p. • Strongly sequential algorithm. +S ababcde ababcde xxxxxabxxxabcxxxabcde
Knuth-Morris-Pratt type algorithms (2) • Morris-Pratt: • Knuth-Morris-Pratt: ababcde ababcde xxxxxabxxxabcxxxabcde ababcde (KMP also skips mismatching letters) ababcde xxxxxabxxxabcxxxabcde
Pattern Matching - Complexity • Overall complexity: • Pattern or text is a realization of random sequence: • Question: complexity of KMP?
Subadditivity – Deterministic Sequence Fekete (1923) • Subadditivity: • Superadditivity:
Example: Longest Common Subsequence • Superadditive: • Hence: ababcafbcdabcde ababcafb cdabcde abcdeabcdfabcab abcdeabc dfabcab LCS: "abcabcdabc" (10) LCS: "abcab" (5), "dabc" (4) (Conjectured by Steele in 1982)
Subadditivity – "Almost subadditive" DeBruijn and Erdös (1952) • positive and non-decreasing sequence • "Almost subadditive":
Subadditive Ergodic Theorem Kingman (1976), Liggett (1985) • is a stationary sequence • does not depend on m
Almost Subadditive Ergodic Theorem Deriennic (1983) • Subadditivity can be relaxed towith • Then, too:
Martingales • A sequenceis a martingale with respect to the filtration if for all : • defines a random variable depending on the knowledge contained in .
Martingale Differences • The martingale difference is defined asso that: • Observe:
Azuma's Inequality (1) • Let be a martingale • Define the martingale difference as(The mean of the same element but depending on different knowledge) • Observe: (Deviation from the mean)
Hoeffding's Inequality • Let be a martingale • Let there exist constant • Then:
Azuma's Inequality (2) • Summary: • If is bounded, we know how to assess the deviation from the mean. • So now we need a bound on . • Trick: Let be an independent copy of . • Then:
Azuma's Inequality (3) • Hence: • And we can postulate:
Azuma's Inequality (4) • Let be a martingale • If there exists constant such thatwhere is an independent copy of • Then:
KMP: Unavoidable alignment positions • A position in the text is called unavoidable AP if for any r,l it's an AP when run on . • KMP-like algorithms have the same set of unavoidable alignment positionswhere • Example: abcde xxxxxabxxxabcxxxabcde
Pattern Matching: l-convergence • An algorithm is l-convergent if there exists an increasing sequence of unavoidable alignment positions satisfying • l-convergence indicates the maximum size "jumps" for an algorithm.
KMP: Establishing m-convergence • Let AP be an alignment position • Define: • Hence: and so KMP-like algorithms are m-convergent.
KMP: Establishing subadditivity (1) • If (number of comparisons) is subadditive we can prove linear complexity of KMP-like algorithms. • We have to show: is (almost) subadditive: • Approach:An l-convergent sequential algorithm satisfies:
KMP: Establishing subadditivity (2) • Proof: • : the smallest unavoidable AP greater than r. • We split into and .
KMP: Establishing subadditivity (3) • Comparisons done after r with AP before r: • Comparisons with AP between r and : • No more than m comparisons can be saved at Contributing to only Contributing to and ? ? ? S2 ? Contributing to and ? S1 ?
KMP: Establishing subadditivity (4) • Comparisons with AP between r and : • No more than m comparisons can be saved at Contributing to only ? ? ? S3 ? Contributing to and
KMP: Establishing subadditivity (5) • So we are able to bound: • We have shown: is (almost) subadditive: • Now we are able to apply the Subadditive Ergodic Theorem.
KMP: Different Modeling Assumptions • Deterministic Model:Text and pattern are non random. • Semi-Random Model:Text is a realization of a stationary and ergodic sequence, pattern is given. • Stationary model:Both text and pattern are realizations of a stationary and ergodic sequence.
KMP: Applying the Subadditive Ergodic Theorem • We have shown: is (almost) subadditive • Deterministic Model: • Semi-Random Model: • Stationary Model:
KMP: Applying Azuma's Inequality • satisfies:where is an independent copy of . • So, using Azuma's Inequality: • is concentrated around its mean:
Conclusion • Using the Subadditive Ergodic Theorem we can show there exists a linearity constant for the worst and average case resp.KMP has linear complexity. • The Subadditive Ergodic Theorem proves the existence of this constant but says nothing how to compute it. • Using Azuma's Inequality we can show that the number of comparisons is well concentrated around its mean.