350 likes | 729 Views
Aho-Corasick Algorithm. Generalizes KMP to handle sets of strings New ideas keyword trees failure functions/links output links. Problem Definition. Input P, a set of z patterns {P 1 , …, P z } (total length n) text T, length m Task
E N D
Aho-Corasick Algorithm • Generalizes KMP to handle sets of strings • New ideas • keyword trees • failure functions/links • output links
Problem Definition • Input • P, a set of z patterns {P1, …, Pz} (total length n) • text T, length m • Task • Output location of all occurrences of each pattern Pi in T • Bounds • O(n+zm) bound using exact string matching algs • Goal: O(n+m+k) bound where k is the number of occurrences of some pattern Pi in T
p t o o e p o t e o 4 1 2 3 Keyword Tree • P = {poet, pope, popo, too}
Observations • Keyword tree K construction • Can be done in O(n) time • remember n is total length of all patterns • Naïve search algorithm with keyword tree K • Align tree to each position in T and see if there is a match • O(nm) time • Use KMP ideas to speed this up
Failure functions • Temporary assumption • no pattern in P is a proper substring of another pattern in P • Definitions • For each node v of K, L(v) denotes the concatenation of the characters from the root to node v • For any node v of K, define lp(v) to be the length of the longest proper suffix of L(v) that is a prefix of some pattern in P • For a node v of K, let f(v) denote the unique node in K with the suffix of L(v) of length lp(v) • Note, f(v) = the root of K if lp(v) = 0. • Directed edge (v, f(v)) is a failure link
p t o o e p o t e o 4 1 2 3 Keyword Tree and failure links • P = {poet, pope, popo, too}
Using failure links in search • Setting: Match up to node v in k, T(c-1) in T • T(c) does not occur in any edge out of v • Update • “Shift” T by c - lp(v) spots to the left • This lines up T with the maximal prefix of some pattern in P that is guaranteed by definition of lp(v) • v = f(v) • Next comparison will still be with T(c) against the edges out of the new node v • Full details on page 56
Recursive structure for computing failure links • Base Case • v is root or v is direct child of root: f(v) = root • Recursive Case • Compute f(v) for v that is k+1 steps away assuming f(w) has been computed for all w <= k steps away • Observation • L(v) = L(parent(v)) concatenate x • x is character labeling edge (parent(v), v) • Thus, f(parent(v)) can help
Computing failure links • Def: x is the character on (parent(v), v) • Algorithm for node v w = f(parent(v)); /* using information about parent to help */ while (there is no edge out of w labeled x) and (w is not equal to r) w = f(w); if there is an edge (w, w’) out of w labeled x f(v) = w’ else f(v) = r • Do this in a breadth-first manner through tree
Keyword Tree and failure links • P = {poet, pope, popo, too} p t o o o e p o t e o 4 1 2 3
Linear time argument • Consider a single pattern p of length t • Let p also denote path of p in K • Time to compute failure links for all nodes on p is O(t) • For any v in p, lp(v) <= lp(parent(v)) + 1 • Thereore, max lp(v) is t • maximum number of decrements of lp(w) and thus maximum number of assignments to w inside while loop for all nodes on path p is t (assignment in red on prev. slide) • Each assignment of w in while loop decreases lp(w) by at least one • lp(w) is never negative along the whole path p • Total number of assignments is O(t)
Allowing substrings • Remove assumption • no pattern in P is a proper substring of another pattern in P • Definitions • The output link (if there is one) at node v points at the numbered node v that is reachable from v following the fewest number of failure links • Adding output link computation to Algorithm for f(v) • If f(v) is a numbered node, then output(v) = f(v) • else if output(f(v)) is defined, then output(v) = output(f(v)) • else output(v) is undefined
Keyword Tree and output links • P = {at, pot, potato, tatter} a p t t o a 4 t p t 1 a t e t r o 3 2