1 / 13

Aho-Corasick Algorithm

Aho-Corasick Algorithm. Generalizes KMP to handle sets of strings New ideas keyword trees failure functions/links output links. Problem Definition. Input P, a set of z patterns {P 1 , …, P z } (total length n) text T, length m Task

sylvie
Download Presentation

Aho-Corasick Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aho-Corasick Algorithm • Generalizes KMP to handle sets of strings • New ideas • keyword trees • failure functions/links • output links

  2. Problem Definition • Input • P, a set of z patterns {P1, …, Pz} (total length n) • text T, length m • Task • Output location of all occurrences of each pattern Pi in T • Bounds • O(n+zm) bound using exact string matching algs • Goal: O(n+m+k) bound where k is the number of occurrences of some pattern Pi in T

  3. p t o o e p o t e o 4 1 2 3 Keyword Tree • P = {poet, pope, popo, too}

  4. Observations • Keyword tree K construction • Can be done in O(n) time • remember n is total length of all patterns • Naïve search algorithm with keyword tree K • Align tree to each position in T and see if there is a match • O(nm) time • Use KMP ideas to speed this up

  5. Failure functions • Temporary assumption • no pattern in P is a proper substring of another pattern in P • Definitions • For each node v of K, L(v) denotes the concatenation of the characters from the root to node v • For any node v of K, define lp(v) to be the length of the longest proper suffix of L(v) that is a prefix of some pattern in P • For a node v of K, let f(v) denote the unique node in K with the suffix of L(v) of length lp(v) • Note, f(v) = the root of K if lp(v) = 0. • Directed edge (v, f(v)) is a failure link

  6. p t o o e p o t e o 4 1 2 3 Keyword Tree and failure links • P = {poet, pope, popo, too}

  7. Using failure links in search • Setting: Match up to node v in k, T(c-1) in T • T(c) does not occur in any edge out of v • Update • “Shift” T by c - lp(v) spots to the left • This lines up T with the maximal prefix of some pattern in P that is guaranteed by definition of lp(v) • v = f(v) • Next comparison will still be with T(c) against the edges out of the new node v • Full details on page 56

  8. Recursive structure for computing failure links • Base Case • v is root or v is direct child of root: f(v) = root • Recursive Case • Compute f(v) for v that is k+1 steps away assuming f(w) has been computed for all w <= k steps away • Observation • L(v) = L(parent(v)) concatenate x • x is character labeling edge (parent(v), v) • Thus, f(parent(v)) can help

  9. Computing failure links • Def: x is the character on (parent(v), v) • Algorithm for node v w = f(parent(v)); /* using information about parent to help */ while (there is no edge out of w labeled x) and (w is not equal to r) w = f(w); if there is an edge (w, w’) out of w labeled x f(v) = w’ else f(v) = r • Do this in a breadth-first manner through tree

  10. Keyword Tree and failure links • P = {poet, pope, popo, too} p t o o o e p o t e o 4 1 2 3

  11. Linear time argument • Consider a single pattern p of length t • Let p also denote path of p in K • Time to compute failure links for all nodes on p is O(t) • For any v in p, lp(v) <= lp(parent(v)) + 1 • Thereore, max lp(v) is t • maximum number of decrements of lp(w) and thus maximum number of assignments to w inside while loop for all nodes on path p is t (assignment in red on prev. slide) • Each assignment of w in while loop decreases lp(w) by at least one • lp(w) is never negative along the whole path p • Total number of assignments is O(t)

  12. Allowing substrings • Remove assumption • no pattern in P is a proper substring of another pattern in P • Definitions • The output link (if there is one) at node v points at the numbered node v that is reachable from v following the fewest number of failure links • Adding output link computation to Algorithm for f(v) • If f(v) is a numbered node, then output(v) = f(v) • else if output(f(v)) is defined, then output(v) = output(f(v)) • else output(v) is undefined

  13. Keyword Tree and output links • P = {at, pot, potato, tatter} a p t t o a 4 t p t 1 a t e t r o 3 2

More Related