350 likes | 504 Views
Efficient String Matching : An Aid to Bibliographic Search. Alfred V. Aho and Margaret J. Corasick Bell Laboratories. Virus Definition. Each virus has its peculiar signature Example in ClamAV _0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed
E N D
Efficient String Matching : An Aid to Bibliographic Search Alfred V. Aho and Margaret J. Corasick Bell Laboratories
Virus Definition • Each virus has its peculiar signature • Example in ClamAV • _0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed • _0017_0001_000 virus index • Hex(21)=Dec(33)=‘!’ • Match the signature for detecting virus
Regular Expression • Use RE to describe the signature • ? can be any one char • W32.Hybris.C (Clam)=4000?????????????83??????75f2e9????ffff00000000 • * can be any chars (including no char) • Oror-fam (Clam)=495243*56697275*53455859330f5455*4b617a61*536e617073686f • {n1-n2}, there are n1~n2 chars between two parts • Worm.Bagle.AG-empty (Clam)=6e74656e742d547970653a206170706c69636174696f6e2f6f637465742d73747265616d3b{40-130}2d2d2d2d2d2d2d2d
Introduction • Locate all occurrences of any of a finite number of keywords in a string of text. • Consists of two parts : • constructing a finite state pattern matching machine from the keywords • using the pattern matching machine to process the text string in a single pass.
Pattern Matching Machine(1) • Our problem is to locate and identify all substrings of x which are keywords in K. • K : K={y1,y2,…,yk} be a finite set of strings which we shall call keywords • x : x is an arbitrary string which we shall call the text string. • The behavior of the pattern matching machine is dictated by three functions: a goto function g, a failure function f, and an output function output.
Pattern Matching Machine(2) • g (s,a) = s’ or fail:maps a pair consisting of a state and an input symbol into a state or the message fail. • f (s) = s’:maps a state into a state, and is consulted whenever the goto function reports fail. • output (s) = keywords:associating a set of keyword (possibly empty) with every state.
Pattern Matching Machine Example with keywords {he,she,his,hers}
Start state is state 0. • Let s be the current state and a the current symbol of the input string x. • Operating cycle • If g(s,a)=s’, makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol. • If g(s,a)=fail, make a failure transition f. If f(s)=s’, the machine repeats the cycle with s’ as the current state and a as the current input symbol.
Example • Text: u s h e r s State: 0 0 3 4 5 8 9 2 • In state 4, since g(4,e)=5, and the machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits output(5)
Example Cont’d • In state 5 on input symbol r, the machine makes two state transitions in its operating cycle. • Since g(5,r)=fail, M enters state 2=f(5) . Then since g(2,r)=8, M enters state 8 and advances to the next input symbol. • No output is generated in this operating cycle.
Algorithm 1. Pattern matching machine. Input. A text string x = a1 a2 … a nwhere each a iis an input symbol and a pattern matching machine M with goto function g, failure function f, and output function output, as described above. Output. Locations at which keywords occur in x. Method. begin state ←0 fori ← 1 until n do begin whileg (state, a i ) = fail dostate ← f(state) state ← g (state, a i ) if output (state)≠ empty then begin printi print output (state) end end end
Construction the functions • Two part to the construction • First:Determine the states and the goto function. • Second:Compute the failure function. • Output function start at first, complete at second.
Construction of Goto function • Construct a goto graph like next page. • New vertices and edges to the graph, starting at the start state. • Add new edges only when necessary. • Add a loop from state 0 to state 0 on all input symbols other than the first one in each keyword.
Construction of Goto functionwith keywords {he,she,his,hers}
Algorithm 2 Algorithm 2. Construction of the goto function. Input. Set of keywords K = {yl, y2, . . . . . yk}. Output. Goto function g and a partially computed output function output. Method. We assume output(s) is empty when state s is first created, and g(s, a) = fail if a is undefined or if g(s, a) has not yet been defined. The procedure enter(y) inserts into the goto graph a path that spells out y. begin newstate ← 0 for i ← 1 until k doenter(y i ) for all a such that g(0, a) = fail do g(0, a) ← 0 end
procedureenter(a 1 a 2 … a m): begin state ←0; j ← 1 whileg (state, aj )≠ fail do begin state ← g (state, aj) j ← j + l end for p ← j until m do begin newstate ← newstate + 1 g (state, ap ) ← newstate state ← newstate end output(state) ← { a 1 a 2 … a m} end
Construction of Failure function • Depth of s:the length of the shortest path from the start state to state s. • The states of depth d can be determined from the states of depth d-1. • Make f(s)=0 for all states s of depth 1.
Construction of Failure function Cont’d • Compute failure function for the state of depth d, each state r of depth d-1: • 1. If g(r,a)=fail for all a, do nothing. • 2. Otherwise, for each a such that g(r,a)=s, do the following: • a. Set state=f(r) . • b. Execute state ←f(state) zero or more times, until a value for state is obtained such that g(state,a)≠fail . • c. Set f(s)=g(state,a) .
Algorithm 3 Algorithm 3. Construction of the failure function. Input. Goto function g and output function output from Algorithm 2. Output. Failure function fand output function output. Method. begin queue ← empty for each a such that g(0, a) = s≠0 do begin queue ← queue ∪ {s} f(s) ← 0 end
while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each asuch that g(r, a) = s≠fail do begin queue ← queue ∪ {s} state ← f(r) whileg (state, a) = fail dostate ← f(state) f(s) ← g(state, a) output(s) ←output(s) ∪ output(f(s)) end end end
About construction • When we determine f(s)=s’, we merge the outputs of state s with the output of state s’. • In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1. • To avoid above, we can use the deterministic finite automaton, which discuss later.
Properties of Algorithms 1,2,3 • Lemma 1: Suppose that in the goto graph state s is represented by the string u and state t is represented by the string v. Then f(s)=t iff v is the longest proper suffix of u that is also a prefix of some keyword. • Proof : • Suppose u=a1a2…aj, and a1a2…aj-1 represents state r, let r1,r2,…,rn be the sequence of states : 1. r1=f(r) ; 2. ri+1=f(ri) ; 3.g(ri,aj)=fail for 1≦i<n ; 4.g(rn,aj)=t • Suppose vi represents state ri, v1 is the longest proper suffix of a1a2…aj-1 that is a prefix of some keyword; v2 is the longest proper suffix of v1 that is a prefix of some keyword, and so on. • Thus vn is the longest suffix of a1a2…aj-1 such that vnaj is a prefix of some keyword.
Properties of Algorithms 1,2,3 • Lemma 2 : The set output(s) contains y if and only if y is a keyword that is a suffix of the string representing state s. • Proof : • Consider a string y in output(s). • If y is added to output(s) by algorithm 2, then y=u and y is a keyword. • If y is added to output(s) by algorithm 3, then y is in output(f(s)). If y is a proper suffix of u, then from the inductive hypothesis and Lemma 1 we know output(f(s)) contains y.
Properties of Algorithms 1,2,3 • Lemma 3 : After the jth operating cycle, Algorithm 1 will be in state s iff s is represented by the longest suffix of a1a2…aj that is a prefix of some keyword. • Proof : Similar to Lemma 1. • THEOREM 1 :Algorithms 2 and 3 produce valid goto,failure, and output functions. • Proof : By Lemmas 2 and 3.
Time Complexity of Algorithms 1, 2, and 3 • THEOREM 2 :Using the goto, failure and output functions created by Algorithms 2 and 3, Algorithm 1 makes fewer than 2n state transitions in processing a text string of length n. • From state s of depth d Algorithm 1 make d failure transitions at most in one operating cycle. • Number of failure transitions must be at least one less than number of goto transitions. • processing an input of length n Algorithm 1 makes exactly n goto transitions. Therefore the total number of state transitions is less than 2n.
Time Complexity of Algorithms 1, 2, and 3 • THEOREM 3 : Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords. • Proof : • Straightforward • THEOREM 4 : Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords. • Proof : • Total number of executions of state← f(state) is bounded by the sum of the lengths of the keywords. • Using linked lists to represent the output set of a state, we can execute the statement output(s) ← output(s)∪ output(f(s)) in constant time.
procedureenter(a 1 a 2 … a m): begin state ←0; j ← 1 whileg (state, aj )≠ fail do begin state ← g (state, aj) j ← j + l end for p ← j until m do begin newstate ← newstate + 1 g (state, ap ) ← newstate state ← newstate end output(state) ← { a 1 a 2 … a m} end
while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each asuch that g(r, a) = s≠fail do begin queue ← queue ∪ {s} state ← f(r) whileg (state, a) = fail dostate ← f(state) f(s) ← g(state, a) output(s) ←output(s) ∪ output(f(s)) end end end
Eliminating Failure Transitions • Using in algorithm 1 • δ(s, a), a next move function δ such that for each state s and input symbol a. • By using the next move function δ, we can dispense with all failure transitions, and make exactly one state transition per input character.
Algorithm 4. Construction of a deterministic finite automaton. Input. Goto function g from Algorithm 2 and failure function f from Algorithm 3. Output. Next move function 8. Method. begin queue ← empty for each symbol ado begin δ(0, a) ← g(0, a) ifg (0, a) ≠ 0thenqueue ← queue∪ {g (0, a) } end whilequeue ≠ emptydo begin let r be the next state in queue queue ← queue - {r} for each symbol a do if g(r, a) = s ≠ faildo begin queue ← queue ∪ {s} δ(r, a) ← s end elseδ(r, a) ←δ(f(r), a) end end
Fig. 3. Next move function. input symbol next state state 0: h 1 s 3 . 0 state 1 : e 2 i 6 h 1 s 3 . 0 state 9:state7: state3 : h 4 s 3 . 0 state 5:state2 : r 8 h 1 s 3 . 0 state 6 : s 7 h 1 . 0 state 4 : e 5 i 6 h 1 s 3 . 0 state 8 : s 9 h 1 . 0
Conclusion • Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass. • Using Next move function • can potentially reduce state transitions by 50%, but more memory. • Spend most time in state 0 from which there are no failure transitions.
h e r s 0 9 1 2 8 {h,s}’ i s 6 7 s h e 3 4 5