Pattern Matching Algorithms: An Overview

Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009

Overview • Pattern Matching in 1D • Dictionary Matching • Pattern Matching in 2D • Indexing • Suffix Tree • Suffix Array • Research Directions

What is Pattern Matching? Given a pattern and text, find the pattern in the text.

What is Pattern Matching? • Σ is an alphabet. • Input: Text T = t1 t2 … tn Pattern P = p1 p2 … pm • Output: All i such that

Pattern Matching - Example Input: P=cagc = {a,g,c,t} T=acagcatcagcagctagcat • Output:{2,8,11} acagcatcagcagctagcat 1 2 3 4 5 6 78 …. 11

Pattern Matching Algorithms • Naïve Approach • Compare pattern to text at each location. • O(mn) time. • More efficient algorithms utilize information from previous comparisons.

Pattern Matching Algorithms • Linear time methods have two stages • preprocess pattern in O(m) time and space. • scan text in O(n) time and space. • Knuth, Morris, Pratt (1977): automata method • Boyer, Moore (1977): can be sublinear

KMP Automaton P = ababcb

Dictionary Matching • Σ is an alphabet. • Input: Text T = t1 t2 … tn Dictionary of patterns D = {P1, P2, …, Pk} All characters in patterns and text belong to Σ. • Output: All i, j such that where mj = |Pj|

Dictionary Matching Algorithms • Naïve Approach: • Use an efficient pattern matching algorithm for each pattern in the dictionary. • O(kn) time. More efficient algorithms process text once.

AC Automaton • Aho and Corasick extended the KMP automaton to dictionary matching • Preprocessing time: O(d) • Matching time: O(n log |Σ| +k). Independent of dictionary size!

AC Automaton D = {ab, ba, bab, babb, bb}

Dictionary Matching • KMP automaton does not depend on alphabet size while AC automaton does – branching. • Dori, Landau (2006): AC automaton is built in linear time for integer alphabets. • Breslauer (1995) eliminates log factor in text scanning stage.

Periodicity A crucial task in preprocessing stage of most pattern matching algorithms: computing periodicity. Many forms • failure table • witnesses

Periodicity • A periodic pattern can be superimposed on itself without mismatch before its midpoint. • Why is periodicity useful? Can quickly eliminate many candidates for pattern occurrence.

Periodicity Definition: • S is periodic if S = and is a proper suffix of . • S is periodic if its longest prefix that is also a suffix is at least half |S|. • The shortest period corresponds to the longest border.

Periodicity - Example S = abcabcabcab |S| = 11 • Longest border of S: b =abcabcab; |b| = 8 so S is periodic. • Shortest period of S: =abc = 3 so S is periodic.

Witnesses Popular paradigm in pattern matching: • find consistent candidates • verify candidates consistent candidates → verification is linear

Witnesses • Vishkin introduced the duel to choose between two candidates by checking the value of a witness. • Alphabet-independent method.

Witnesses Preprocess pattern: • Compute witness for each location of self-overlap. • Size of witness table: , if P is periodic, , otherwise.

Witnesses • WIT[i] = any k such that P[k] ≠ P[k-i+1]. • WIT[i] = 0, if there is no such k. k is a witness against i being a period of P. Example: Pattern Witness Table

Witnesses Let j>i. Candidates i and j are consistent if • they are sufficiently far from each other OR • WIT[j-i]=0.

Duel Scan text: • If pair of candidates is close and inconsistent, perform duel to eliminate one (or both). • Sufficient to identify pairwise consistent candidates: transitivity of consistent positions. P= T= witness i j a b ?

2D Pattern Matching MRI • Σ is an alphabet. • Input: Text T [1… n, 1… n] Pattern P [1… m, 1… m] • Output: All (i, j) such that

2D Pattern Matching - Example Input: Pattern = {A,B} Text Output:{ (1,4),(2,2),(4, 3)}

Bird / Baker • First linear-time 2D pattern matching algorithm. • View each pattern row as a metacharacter to linearize problem. • Convert 2D pattern matching to 1D.

Bird / Baker Preprocess pattern: • Name rows of pattern using AC automaton. • Using names, pattern has 1D representation. • Construct KMP automaton of pattern. Identical rows receive identical names.

Bird / Baker Scan text: • Name positions of text that match a row of pattern, using AC automaton within each row. • Run KMP on named columns of text. Since the 1D names are unique, only one name can be given to a text location.

Bird / Baker - Example Preprocess pattern: • Name rows of pattern using AC automaton. • Using names, pattern has 1D representation. • Construct KMP automaton of pattern.

Bird / Baker - Example Scan text: • Name positions of text that match a row of pattern, using AC automaton within each row. • Run KMP on named columns of text.

Bird / Baker • Complexity of Bird / Baker algorithm: time and space. • Alphabet-dependent. • Real-time since scans text characters once. • Can be used for dictionary matching: replace KMP with AC automaton.

2D Witnesses • Amir et. al. – 2D witness table can be used for linear time and space alphabet-independent 2D matching. • The order of duels is significant. • Duels are performed in 2 waves over text.

Indexing • Index text • Suffix Tree • Suffix Array • Find pattern in O(m) time • Useful paradigm when text will be searched for several patterns.

banana$ anana$ nana$ ana$ na$ a$ $ n a b n a a a n n n a a a n a Suffix Trie T = banana$ suf7 suf1 suf2 suf3 suf4 suf5 suf6 suf7 $ $ suf6 $ suf5 $ suf4 $ suf3 $ suf2 $ suf1 • One leaf per suffix. • An edge represents one character. • Concatenation of edge-labels on the path from the root to leaf i spells the suffix that starts at position i.

banana$ anana$ nana$ ana$ na$ a$ $ na a banana$ na na$ na$ Suffix Tree T = banana$ [7,7] [3,4] suf1 suf2 suf3 suf4 suf5 suf6 suf7 $ [2,2] [1,7] [7,7] [3,4] [5,7] $ [7,7] suf6 suf1 [5,7] [7,7] $ suf3 suf5 $ suf2 suf4 • Compact representation of trie. • A node with one child is merged with its parent. • Up to n internal nodes. • O(n) space by using indices to label edges

Suffix Tree Construction • Naïve Approach: O(n2) time • Linear-time algorithms:

Suffix Tree Construction • Linear-time suffix tree construction algorithms rely on suffix links to facilitate traversal of tree. • A suffix link is a pointer from a node labeled xS to a node labeled S; x is a character and S a possibly empty substring. • Alphabet-dependent suffix links point from a node labeled S to a node labeled xS, for each character x.

Index of Patterns • Can answer Lowest Common Ancestor (LCA) queries in constant time if preprocess tree accordingly. • In suffix tree, LCA corresponds to Longest Common Prefix (LCP) of strings represented by leaves.

Index of Patterns To index several patterns: • Concatenate patterns with unique characters separating them and build suffix tree. Problem: inserts meaningless suffixes that span several patterns. OR • Build generalized suffix tree – single structure for suffixes of individual patterns. Can be constructed with Ukkonen’s algorithm.

Suffix Array • The Suffix Array stores lexicographic order of suffixes. • More space efficient than suffix tree. • Can locate all occurrences of a substring by binary search. • With Longest Common Prefix (LCP) array can perform even more efficient searches. • LCP array stores longest common prefix between two adjacent suffixes in suffix array.

Suffix Array Index Suffix Index Suffix LCP 1 mississippi 11 i 0 2 ississippi 8 ippi 1 3 ssissippi 5 issippi 1 4 sissippi 2 ississippi 4 5 issippi 1 mississippi 0 6 ssippi 10 pi 0 7 sippi 9 ppi 1 8 ippi 7 sippi 0 9 ppi 4 sissippi 2 10 pi 6 ssippi 1 11 i 3 ssissippi 3 sort suffixes alphabetically

1 2 3 4 5 6 7 8 9 10 11 11 8 5 2 1 10 9 7 4 6 3 Index Suffix Suffix array T = mississippi LCP 0 1 1 4 0 0 1 0 2 1 3

Search in Suffix Array O(m log n): Idea: two binary searches- search for leftmost position of X- search for rightmost position of X In between are all suffixes that begin with X With LCP array: O(m + log n) search.

Suffix Array Construction • Naïve Approach: O(n2) time • Indirect Construction: • preorder traversal of suffix tree • LCA queries for LCP. Problem: does not achieve better space efficiency.

Suffix Array Construction • Direct construction algorithms: • LCP array construction: range-minima queries.

Compressed Indices Suffix Tree: O(n) words = O(n log n) bits Compressed suffix tree • Grossi and Vitter (2000) • O(n) space. • Sadakane (2007) • O(n log |Σ|) space. • Supports all suffix tree operations efficiently. • Slowdown of only polylog(n).

Compressed Indices Suffix array is an array of n indices, which is stored in: O(n) words = O(n log n) bits Compressed Suffix Array (CSA) Grossi and Vitter (2000) • O(n log |Σ|) bits • access time increased from O(1) to O(logε n) Sadakane (2003) • Pattern matching as efficient as in uncompressed SA. • O(n log H0) bits • Compressed self-index

Compressed Indices FM – index • Ferragina and Manzini (2005) • Self-indexing data structure • First compressed suffix array that respects the high-order empirical entropy • Size relative to compressed text length. • Improved by Navarro and Makinen (2007)

Dynamic Suffix Tree Dynamic Suffix Tree • Choi and Lam (1997) • Strings can be inserted or deleted efficiently. • Update time proportional to string inserted/deleted. • No edges labeled by a deleted string. • Two-way pointer for each edge, which can be done in space linear in the size of the tree.

Dynamic Suffix Array Dynamic Suffix Array • Recent work by Salson et. al. • Can update suffix array after construction if text changes. • More efficient than rebuilding suffix array. • Open problems: • Worst case O(n log n). • No online algorithm yet.

Pattern Matching Algorithms: An Overview

Pattern Matching Algorithms: An Overview

Presentation Transcript

Pattern Matching

Pattern Matching

Pattern Matching

Algorithms for pattern matching and pattern discovery in music

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Combinatorial Pattern Matching

Parameterized Pattern Matching by Boyer-Moore-type Algorithms

Point-set algorithms for pattern discovery and pattern matching in music

Pattern matching

Strings and Pattern Matching Algorithms

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Stack-based Algorithms for Pattern Matching on DAGs

Pattern Matching

Pattern Matching