Parallel String Matching Algorithm(s) Using Associative Processors

Parallel String Matching Algorithm(s) Using Associative Processors Original work by Mary Esenwein and Dr. Johnnie Baker Presented by Shannon Steinfadt April 18, 2007

String Matching Problem • Aka. pattern matching or string searching • Useful in many applications such as text editing and information retrieval, DNA analysis, Homeland Security

What are we doing? • Given a pattern and some text, find out if the pattern is IN the text • Is pattern AB in the text ABAA? If so, where? AB ABAA

What’s the notation? • P is a pattern string of length m • T is a text string of length n, usually n ≥ m

Why use P[j]? How does it relate to T[i+j-1]? Goal of String Matching • To find all occurrences of a pattern string in the text string • Locate all positions i in T such that T[i+j-1] = P[j] for all j, 1 ≤ j ≤ m

Pattern Variations • An exact pattern • A “Don’t Care” character (*) in pattern • Flexibility in matching • * indicates character(s) of the text that are irrelevant to the matching process

General “Don’t Care” Character’s (*) Characteristics • Single character of text • Multiple consecutive text characters • No characters • Combination of above three Example: • Pattern AB*CD could match ABBCD, ABBBBBCD, or ABCD (* is null)

String Matching using ASC • Three parallel algorithms using associative computing (using 1-D mesh) • String matching for exact match • String matching with fixed length “don’t care” • I.e., exactly 1 character • String matching with variable length “don’t care” • a “don’t care” can have any length or be null

ASC Exact Match Algorithm for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;

Text[$] Match[$] Counter[$] Pattern: BBA Text: ABBBABBBABA m=pattern length n=text length j = pattern index i = text index Pattern: BBA patt_ counter patt_length

Text[$] Match[$] Counter[$] Final State of Exact Match Algorithm Pattern: BBA Text: ABBBABBBABA m = pattern length n = text length j = pattern index i = text index

Algorithm for unit length "don't cares" using ASC for (j = patt_length - 1; j >= 0; j--) { if (pattern[j] == '*') Responders are counter[$] == patt_counter; else // pattern[j] is not the “don’t care” character Responders are text[$] == pattern[j] and counter[$] == patt_counter; If no Responders are detected, exit; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;

ASC Exact Match Algorithm (again) for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;

Text[$] Match[$] Counter[$] Pattern: BBA Text: ABBBABBBABA m=pattern length n=text length j = pattern index i = text index Pattern: B*A patt_ counter patt_length

Text[$] Match[$] Counter[$] Final State of Exact Match Algorithm Pattern: B*A Text: ABBBABBBABA m = pattern length n = text length j = pattern index i = text index

VLDC Algorithm (added) • Works on each “segment” of the pattern broken up by the * character • AB*BB*A has three sections • Consecutive ** characters not necessary, not allowed • This VLDC algorithm unique • Provides information to find all continuation points of all matches following each “*”

VLDC ALGORITHM USING ASC int patt_length = m; int maxcell = n + 2; /* Special handling for ‘*’ at end of pattern */ if (pattern[m-1] == ‘*’) { Responders are cell index > 1; Responders set segment$[0] = 1; patt_counter = 1; k = 1; /* Reset initial segment index */ } while ((patt_length -= patt_counter) > 0 && maxcell > 0) { patt_counter = 0; for ( I = patt_length - 1; I>= 0 && pattern[I] != ‘*’; I--) { Responders are text$ == pattern[I] and counter$ == patt_counter and cell index < maxcell; Responders add 1 to counter$ and store result in counter$ of preceding cell; patt_counter++; } Responders are counter$ == patt_counter;

VLDC continued Responders set segment$[k] = patt_counter in next cell; Responders are segment$[k] > 0; maxcell = maximum cell index value of Responders else if no Responders maxcell = 0; All cells become Responders and set counter$ = 0; patt_counter++; k++ } /* When pattern has been processed */ Responders are segment$[--k] > 0; Responders set match$ = 1; /* Special handling for ‘*’ at start of pattern */ if (pattern[0] == ‘*’) { Responders are cell index < maxcell and cell index > 1; Responders set match$ = 1; }

After third pattern segment in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ C$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 12

After second pattern segment in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ Counter$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 (Used to keep pattern segments in order, I.e. AB occurs before BB) 12

After first pattern segment in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ Counter$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 (Used to keep pattern segments in order, I.e. AB occurs before BB) 12

Final State in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ Counter$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 (Used to keep pattern segments in order, I.e. AB occurs before BB) 12

Finding All Continuation Points • Match starts where M$ = 1 • Match to any pattern segment begins where S$[x] == segment length • i.e. where any S$[x] > 0 • Continuation of match in S$[x-1] whose cell/PE index is >= (S$[x] + segment size) of S$[x]’s cell/PE index

Using the Final State in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA S0$ S1$ S2$ T$ M$ C$ • Start with index 2, where there’s a match M$=1 • Work from S2$ down and left, count down 2 values and move into S1$, count down 2 values and move to S0$ • That produces: 246 ABBBA • Any index >= 4 in S1[$] whose value is >0 will also produce a correct match • 2710 ABBBABBBA • 2810 ABBBABBBA • Some of the additional matches are: • 2410 ABBBABBBA • 2412 ABBBABBBABA • 2812 ABBBABBBABA • 6810 ABBBA • 6812 ABBBABA 1 2 3 4 5 6 7 8 9 10 11 12

Existing Algorithms • Sequential Algorithms • Naïve algorithm: O(mn) • Knuth, Morris, & Pratt, or Boyer-Moore: O(m+n) • Parallel Algorithms • A PRAM exact string matching: O(n) • On a reconfigurable mesh: O(1) on n(n-m+1) PEs • On a SIMD hypercube (limited to {0,1}): O(lg n) on n/lg n PEs • On a neural network: O(1) on nm PEs • ASC algorithms: O(m) time on O(n) PEs

Question to consider • The “don’t care” character allows non-matching for an arbitrary length. This is discussed on slide 13. Instead, consider “*” to allow a non-match for two characters and make necessary changes in trace in Slide 15-16.

Parallel String Matching Algorithm(s) Using Associative Processors

Parallel String Matching Algorithm(s) Using Associative Processors

Presentation Transcript

String Matching

A Fast String Matching Algorithm

A Fast String Matching Algorithm

Module 5: String Matching Algorithms

String Matching

Rules for Approximate String Matching

Average Case Analysis of an Exact String Matching Algorithm

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching II

String Matching

String Matching

String Matching: Knuth-Morris-Pratt algorithm

brute force string matching algorithm

String Matching

String matching

Fuzzy Matching Algorithm

String Matching

String Matching