350 likes | 744 Views
Construction of Aho Corasick automaton in Linear time for Integer Alphabets. Shiri Dori & Gad M. Landau University of Haifa. Overview. Classic Aho Corasick Our algorithm Goto Function Failure Function Combining the two Queries in O(m log| Σ |). Set Pattern Matching Problem.
E N D
Construction ofAho Corasickautomaton inLinear timefor Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa
Overview • Classic Aho Corasick • Our algorithm • Goto Function • Failure Function • Combining the two • Queries in O(m log|Σ|)
Set Pattern Matching Problem • Find patterns in text • P={P1, P2, ... Pq}, in T • Aho and Corasick solved it in ’75 • Generalized version of KMP • Uses a state machine
h i he ir is her iri iris Aho Corasick - Example P = {her, iris, he, is}
h i he ir is her iri iris Aho Corasick - Example P = {her, iris, he, is} Travel along the Goto function, which is a trie of all patterns If stuck, travel along KMP-style Failure link
Aho Corasick - Example P = {her, iris, he, is} Travel along the Goto function, which is a trie of all patterns h i When found a pattern, output it he ir is her iri If stuck, travel along KMP-style Failure link iris
Aho Corasick Definitions • Goto function: a trie of the patterns • Failure function: for each label, the largest suffix which is a prefix of a pattern • KMP, but prefix of any pattern qualifies • Output function: patterns ending at this label
Classic Aho Corasick – Analysis • Constructed in O(n) (cumulative pattern length) • Answered queries in O(m + k) • ... For constant alphabets only! • For integer alphabets, Σ=O(nc), algorithm changes depending on branching method • List, Array or Search Tree • Recent developments inspire for better! • Farach-97; Karkkäinen & Sanders-03; • Ko & Aluru-03; Kim, Sim, Park & Park-03
Our Results • Our algorithm achieves better results: • Construction in O(n) time, O(n) space • Query in O(m log|Σ|) • Works for integer alphabets, Σ = O(nc)
Algorithm: Goto Function • Sort patterns in time linear to their length • By building suffix array of Sp=$P1$P2$...$Pq$, and just ignoring non-pattern suffixes • Or by two-pass radix sort, O(D + Σ) = O(n) • Paige & Tarjan, ’87; Andersson & Nilsson, ‘94 • Now create the trie in lexicographic order • Hold a list of sons; insert each new node to the end of the list
Example – Goto Function P = {the, than, this, then} Sorting Patterns P’ = {than, the, then, this}
Example – Goto Function P’ = {than, the, then, this} than, the, then, this than, the, then than, the than t th tha the thi than then this
Example – Goto Function P’ = {than, the, then, this} th t a e i th Sorted List, keep the tail tha the thi than then this
Algorithm: Failure Function • We need to construct Failure links on trie • Original algorithm included traversing trie • We found a deep connection between: • Failure function of the patterns, and • Suffix Tree of the reversed patterns • Or Enhanced Suffix Array • Abouelhoda, Kurtz & Ohlebusch-04; Kim, Jeon & Park-04 • We’ll “learn by example”...
h i $ he ir is h (h) i (i) r (r) si (is) si (is) eh (he) her iri $ $ $ $ $ iri (iri) reh (her) ri (ir) siri (iris) siri (iris) iris $ $ $ $ $ Example – Failure Function P = {he, her, iris, is} • Failure function: “iris” “is” • The reverses: “siri”, “si” • “si” is a prefix of “siri” • (with $ so “is” prefix of a pattern) PR = {eh, reh, siri, si}
Understanding Failure Function • Failure function is defined as: largest suffix, which is a prefix of any pattern • Reverse: “largest suffix” “largest prefix” • Any prefix of a label will be its ancestor in ST • Largest means nearest • “prefix of pattern” “suffix of pattern” • It will be a node in the ST, marked by a $ • So: closest ancestor which is marked by $
Algorithm: Failure Function • We found a deep connection between: • Failure function of the patterns, and • Suffix Tree of the reversed patterns • We define Sp=$P1$P2$...$Pq$ • We define TR to be the suffix tree of (Sp)R • TR can be built in linear time • Can use Enhanced Suffix Array, ER, instead • Note: TR is a Generalized Suffix Tree • How will we link the trie and TR?
h i $ he ir is h (h) i (i) r (r) si (is) eh (he) her iri $ $ $ $ iri (iri) reh (her) ri (ir) siri (iris) iris $ $ $ $ Example – 1-to-1 Mapping Note: “r” doesn’t get a link since it’s not marked by a $
h i $ i (i) si (is) he ir is $ $ h (h) i (i) r (r) si (is) eh (he) iri (iri) siri (iris) her iri $ $ $ $ iri (iri) reh (her) ri (ir) siri (iris) iris $ $ $ $ Example – 1-to-1 Mapping
Algorithm: Review • Build Goto function (trie) • Sort patterns • Construct trie • Build Failure function • Construct TR • Compute proper ancestor for $-marked nodes • Combine information • Through mapping, create Failure links on trie
Adjustment for Integer Alphabet • We used recent developments (SA, ST) • Constructed Goto: using suffix array • Found a connection between Failure function and suffix trees • Thus, reduced the construction to O(n) • Yet, manage to keep queries at O(m log|Σ|) • Again - how?
Queries in O(m log|Σ|) • We’ve built the trie in O(n) • But we have a sorted list • Search is compromised • Our simple solution…
Example – Goto Function P’ = {than, the, then, this} th t a e i th a e i tha the thi Array can be searched in log(#children) than then this
Queries in O(m log|Σ|) • Once the trie is complete • Convert lists in each node to arrays • Array’s size is known; O(n) space overall • Binary search can now be employed • Reduce the time spent in each node to log(# children) = O(log|Σ|) • Can be applied to Suffix Tree built from Suffix Array + LCP
The End Thanks!
Algorithm: Combining the two • Build a 1-to-1 mapping between $-marked nodes in TR and trie nodes • We compute mapping through the string: • For each char in Sp, we keep its Goto node • For each suffix tree node, we know what indices it represents (in (Sp)R, and so in Sp) • Now, build Failure links atop the trie • Like we saw in the example
Algorithm: Failure Function • For each node, find its “proper ancestor” • Closest ancestor marked with a $ • Found with a simple preorder traversal • The properties of TR ensure that... • For each failure link v1v2 • And their corresponding nodes, u1 and u2 • u2 = proper ancestor of u1 • If we link trie and TR, we find the Failure! • How will we link them?
e h i t ey he ir is th eye her iri the iris thei their Example - automaton - Goto Travel along the Goto function, which is a trie of all patterns P = {her, their, eye, iris, he, is}
$ e (e) h (h) i (i) r (r) si (is) t (t) ye (ey) $ $ $ $ $ $ eh (he) eye (eye) ht (th) ieht (thei) iri (iri) reh (her) ri (ir) siri (iris) $ $ $ $ $ $ $ $ eht (the) rieht (their) $ $ Example - TR P = {her, their, eye, iris, he, is}
$ e h i t e (e) h (h) i (i) r (r) si (is) t (t) ye (ey) ey he ir is th $ $ $ $ $ $ eh (he) eye (eye) ht (th) ieht (thei) iri (iri) reh (her) ri (ir) siri (iris) eye her iri the $ $ $ $ $ $ $ $ iris thei eht (the) rieht (their) $ $ their Example - TR and Failure P = {her, their, eye, iris, he, is} iris is the he e eye e their ir
TR - Reversed Suffix Tree • We defined Sp=$P1$P2$...$Pq$ • We define TR to be the suffix tree of (Sp)R • This tree has interesting properties: • Each trie node v is represented by exactly one TR node u, so that Label(v) = Label(u)R • In TR, a node’s label is a prefix of its child’s label; in the trie, it is a suffix of the original • A $-marked node in TR means that the original label is a prefix of a pattern
Example - TR $ e (e) h (h) i (i) r (r) si (is) $ $ $ $ eh (he) iri (iri) reh (her) ri (ir) siri (iris) $ $ $ $ $ P = {her, iris, he, is}
Example - TR and Failure • We took: • Failure of “their” “ir” (from “iris”) • Largest suffix, which is a prefix of a pattern • Their reverse strings are “rieht”, “ri” • Now prefix... its ancestor in a suffix tree! • To be a prefix of a pattern px, should be a suffix of the reverse pattern (px)R • So it will be in suffix tree, and end with a $ P = {her, their, eye, iris, he, is}