90 likes | 226 Views
Regular Expression: Syntax for Specifying String Patterns. Basic Alphabet empty-string: any symbol a in input symbol set Basic Operators disjunction (OR, union): s | t concatenation (AND): s t closure (repetition): s* Extended operators:
E N D
Regular Expression: Syntax for Specifying String Patterns • Basic Alphabet • empty-string: • any symbol a in input symbol set • Basic Operators • disjunction (OR, union): s | t • concatenation (AND): s t • closure (repetition): s* • Extended operators: • ?, +, [a-z], {m,n}, escape, meta-symbols, registers • Chomsky Hierarchy: • regular set (R.E.) • context-free • context-sensitive • recursively enumerable (Tuning Machine) Jing-Shin Chang
Regular Expression: Syntax for Specifying String Patterns • Applications: • wildcard characters (shell commands, filename expansion) • string pattern matching (grep, awk) • search engine (keyword matching, fuzzy match) • string pattern editing/processing (sed, vi, tr) Jing-Shin Chang
Recognition of Regular Expression • Finite (State) Automata • Definition: • a set of states: S • a set of input symbols: (the input symbol alphabet) • a transition (move) function: (s,a) = s’ • initial (start) state: s0 • a set of final (accepting) states: F • Implementation: • state transition table • Deterministic (DFA) • single transition for all states on all input symbols • Non-deterministic (NFA) • more than one transitions for at least one state with some input symbol Jing-Shin Chang
Recognition of Regular Expression • Simulating Deterministic Finite Automata (DFA) • initialization: current_state = s0; input_symbol = 1st symbol • while (current_state not in final_states && input_symbol != EOF) • next_state = (current_state, input_symbol) • input_symbol = next_input_symbol • Simulating non-Deterministic Finite Automata (NFA) • Backtrack/Backup: • remember next alternative configuration (current input & next alternative state) when alternative choices are possible • Parallelism: • trace every possible alternatives in parallel • Look-ahead: • look more input symbols to make it deterministic Jing-Shin Chang
Constructing Automata from R.E. • R.E. => NFA (Thompson’s construction) => DFA => State Minimization • R.E. decomposition into basic alphabets & operators • construct FA for basic alphabets • merging FA’s by operator • R.E. => DFA: state_transition <=> position transition in pattern • annotate RE symbols with position labels • get syntax tree of the annotated pattern • compute {nullable, fistpos, lastpos} • compute follow(i) • s0 = firstpos(root) • construct transition function according to follow(i) Jing-Shin Chang
R.E. and Pattern Matching • Naïve Pattern Matching: • Specify the pattern with a regular expression R.E. for each keyword • Construct a FA for each such R.E., and conduct left-to-right matching: • DFA = State_Transition_Table = Construct_DFA(R.E.) • while (input_pointer != EOF) • stop_state = recognize(input_pointer, DFA) • if fail (stop_state not in final_states) : move input pointer by one character if not match • if success (stop_state in final_states) : output matching status & skip over matched pattern upon successful match • Why Is It Slow? • match multiple keywords multiple times • for each keyword, move input pointer backward to the character next to the last begin of matching & reset to initial state on failure, even though some repeated pattern might appear in recently matched partial string • probability of failure is significantly larger than probability of success match in most applications (success or match only a few times) • will therefore start the next matching session by setting the input pointer one character behind the starting position of the previous match most of the time Jing-Shin Chang
R.E. and Pattern Matching • RE vs. Pattern Matching • R.E. <=> FA for recognizing one of a set of keywords/patterns in input string • say “yes” if input string is in Lang(R.E.) (the regular language for the expression) • Pattern Matching (PM): recognizing the occurrence of any keyword/pattern specified in a regular expression within a text document • specify pattern/keywords with a RE • output all occurrences, in addition to saying yes/no Jing-Shin Chang
R.E. and Pattern Matching • Formal Method for Pattern Matching (PM) • Constructing a FA for (single/multi-keyword) PM is equivalent to constructing a FA that recognizes the regular expression: PM = (.* | RE)* , and outputting a keyword upon visiting a final state of the original FA for recognizing RE • RE = K1 | K2 | K3 | … | Kn (the regular expression for all specified keywords) • “.” : any character not in K1 ~ Kn • “.*”: unspecified patterns (or unknown keywords) • Constructing FA1 for recognizing RE = K1 | K2 | … | Kn • equivalent to merging prefixes of the keywords to avoid redundant forward matching => TRIE lexicon tree = a DFA for RE • Constructing FA2 for recognizing PM = (.*|RE)* • extending FA1 by (a) including ‘unknown keywords’ and (2) introducing epsilon-moves from the original final states to original initial states • on matching failure, redundant backward matching can be avoided if a sub-string preceding current input pointer is the prefix of another keyword • failure function: the state (in TRIE) to backoff on failure (!= init. state if the above mentioned sub-string exists and is non-null) • epsilon-moves & failure function make FA2 a NFA, whose DFA counterpart can be simulated by backtracking Jing-Shin Chang
R.E. and Fast Methods for Pattern Matching • Fast Single Keyword Matching [KMP - Knuth, Morris & Pratt 1977] • Reference: [Aho et. al 1986, Ex. 3.26-3.27] • keyword => state_transition_table • reduce repeated matching suggested by keyword pattern • failure function: where to backoff on failure • Fast Multiple Keyword Matching [AC, Cherry 1982] • Reference: [Aho, Ex. 3.31-32] • keywords => TRIE (state_transition_table) • reduce repeated matching suggested by TRIE of the keywords • TRIE • failure function • Boyer & Moore [1977] • Harrison [1971]: Hashing Method Jing-Shin Chang