280 likes | 477 Views
Parallel String Matching Algorithm(s) Using Associative Processors. Original work by Mary Esenwein and Dr. Johnnie Baker Presented by Shannon Steinfadt April 18, 2007. String Matching Problem. Aka. pattern matching or string searching
E N D
Parallel String Matching Algorithm(s) Using Associative Processors Original work by Mary Esenwein and Dr. Johnnie Baker Presented by Shannon Steinfadt April 18, 2007
String Matching Problem • Aka. pattern matching or string searching • Useful in many applications such as text editing and information retrieval, DNA analysis, Homeland Security
What are we doing? • Given a pattern and some text, find out if the pattern is IN the text • Is pattern AB in the text ABAA? If so, where? AB ABAA
What’s the notation? • P is a pattern string of length m • T is a text string of length n, usually n ≥ m
Why use P[j]? How does it relate to T[i+j-1]? Goal of String Matching • To find all occurrences of a pattern string in the text string • Locate all positions i in T such that T[i+j-1] = P[j] for all j, 1 ≤ j ≤ m
Pattern Variations • An exact pattern • A “Don’t Care” character (*) in pattern • Flexibility in matching • * indicates character(s) of the text that are irrelevant to the matching process
General “Don’t Care” Character’s (*) Characteristics • Single character of text • Multiple consecutive text characters • No characters • Combination of above three Example: • Pattern AB*CD could match ABBCD, ABBBBBCD, or ABCD (* is null)
String Matching using ASC • Three parallel algorithms using associative computing (using 1-D mesh) • String matching for exact match • String matching with fixed length “don’t care” • I.e., exactly 1 character • String matching with variable length “don’t care” • a “don’t care” can have any length or be null
ASC Exact Match Algorithm for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;
Text[$] Match[$] Counter[$] Pattern: BBA Text: ABBBABBBABA m=pattern length n=text length j = pattern index i = text index Pattern: BBA patt_ counter patt_length
Text[$] Match[$] Counter[$] Final State of Exact Match Algorithm Pattern: BBA Text: ABBBABBBABA m = pattern length n = text length j = pattern index i = text index
Algorithm for unit length "don't cares" using ASC for (j = patt_length - 1; j >= 0; j--) { if (pattern[j] == '*') Responders are counter[$] == patt_counter; else // pattern[j] is not the “don’t care” character Responders are text[$] == pattern[j] and counter[$] == patt_counter; If no Responders are detected, exit; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;
ASC Exact Match Algorithm (again) for (j = patt_length - 1; j >= 0; j--) { Responders are text[$] == patt_string[j] and counter[$] == patt_counter; Responders add 1 to counter[$] and store result in counter[$] of preceding cell; patt_counter++; } /* When pattern has been processed */ Responders are counter[$] == patt_length; Responders set match[$] = 1 in next cell;
Text[$] Match[$] Counter[$] Pattern: BBA Text: ABBBABBBABA m=pattern length n=text length j = pattern index i = text index Pattern: B*A patt_ counter patt_length
Text[$] Match[$] Counter[$] Final State of Exact Match Algorithm Pattern: B*A Text: ABBBABBBABA m = pattern length n = text length j = pattern index i = text index
VLDC Algorithm (added) • Works on each “segment” of the pattern broken up by the * character • AB*BB*A has three sections • Consecutive ** characters not necessary, not allowed • This VLDC algorithm unique • Provides information to find all continuation points of all matches following each “*”
VLDC ALGORITHM USING ASC int patt_length = m; int maxcell = n + 2; /* Special handling for ‘*’ at end of pattern */ if (pattern[m-1] == ‘*’) { Responders are cell index > 1; Responders set segment$[0] = 1; patt_counter = 1; k = 1; /* Reset initial segment index */ } while ((patt_length -= patt_counter) > 0 && maxcell > 0) { patt_counter = 0; for ( I = patt_length - 1; I>= 0 && pattern[I] != ‘*’; I--) { Responders are text$ == pattern[I] and counter$ == patt_counter and cell index < maxcell; Responders add 1 to counter$ and store result in counter$ of preceding cell; patt_counter++; } Responders are counter$ == patt_counter;
VLDC continued Responders set segment$[k] = patt_counter in next cell; Responders are segment$[k] > 0; maxcell = maximum cell index value of Responders else if no Responders maxcell = 0; All cells become Responders and set counter$ = 0; patt_counter++; k++ } /* When pattern has been processed */ Responders are segment$[--k] > 0; Responders set match$ = 1; /* Special handling for ‘*’ at start of pattern */ if (pattern[0] == ‘*’) { Responders are cell index < maxcell and cell index > 1; Responders set match$ = 1; }
After third pattern segment in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ C$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 12
After second pattern segment in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ Counter$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 (Used to keep pattern segments in order, I.e. AB occurs before BB) 12
After first pattern segment in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ Counter$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 (Used to keep pattern segments in order, I.e. AB occurs before BB) 12
Final State in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA T$ M$ Counter$ S0$ S1$ S2$ Responder$ 1 2 Patt_counter 3 4 5 6 7 Maxcell 8 9 10 11 (Used to keep pattern segments in order, I.e. AB occurs before BB) 12
Finding All Continuation Points • Match starts where M$ = 1 • Match to any pattern segment begins where S$[x] == segment length • i.e. where any S$[x] > 0 • Continuation of match in S$[x-1] whose cell/PE index is >= (S$[x] + segment size) of S$[x]’s cell/PE index
Using the Final State in VLDC Algorithm Pattern: AB*BB*A Text: ABBBABBBABA S0$ S1$ S2$ T$ M$ C$ • Start with index 2, where there’s a match M$=1 • Work from S2$ down and left, count down 2 values and move into S1$, count down 2 values and move to S0$ • That produces: 246 ABBBA • Any index >= 4 in S1[$] whose value is >0 will also produce a correct match • 2710 ABBBABBBA • 2810 ABBBABBBA • Some of the additional matches are: • 2410 ABBBABBBA • 2412 ABBBABBBABA • 2812 ABBBABBBABA • 6810 ABBBA • 6812 ABBBABA 1 2 3 4 5 6 7 8 9 10 11 12
Existing Algorithms • Sequential Algorithms • Naïve algorithm: O(mn) • Knuth, Morris, & Pratt, or Boyer-Moore: O(m+n) • Parallel Algorithms • A PRAM exact string matching: O(n) • On a reconfigurable mesh: O(1) on n(n-m+1) PEs • On a SIMD hypercube (limited to {0,1}): O(lg n) on n/lg n PEs • On a neural network: O(1) on nm PEs • ASC algorithms: O(m) time on O(n) PEs
Question to consider • The “don’t care” character allows non-matching for an arbitrary length. This is discussed on slide 13. Instead, consider “*” to allow a non-match for two characters and make necessary changes in trace in Slide 15-16.