1 / 9

Regular Expression: Syntax for Specifying String Patterns

Regular Expression: Syntax for Specifying String Patterns. Basic Alphabet empty-string:  any symbol a in input symbol set  Basic Operators disjunction (OR, union): s | t concatenation (AND): s  t closure (repetition): s* Extended operators:

fancy
Download Presentation

Regular Expression: Syntax for Specifying String Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular Expression: Syntax for Specifying String Patterns • Basic Alphabet • empty-string:  • any symbol a in input symbol set  • Basic Operators • disjunction (OR, union): s | t • concatenation (AND): s t • closure (repetition): s* • Extended operators: • ?, +, [a-z], {m,n}, escape, meta-symbols, registers • Chomsky Hierarchy: • regular set (R.E.) • context-free • context-sensitive • recursively enumerable (Tuning Machine) Jing-Shin Chang

  2. Regular Expression: Syntax for Specifying String Patterns • Applications: • wildcard characters (shell commands, filename expansion) • string pattern matching (grep, awk) • search engine (keyword matching, fuzzy match) • string pattern editing/processing (sed, vi, tr) Jing-Shin Chang

  3. Recognition of Regular Expression • Finite (State) Automata • Definition: • a set of states: S • a set of input symbols:  (the input symbol alphabet) • a transition (move) function: (s,a) = s’ • initial (start) state: s0 • a set of final (accepting) states: F • Implementation: • state transition table • Deterministic (DFA) • single transition for all states on all input symbols • Non-deterministic (NFA) • more than one transitions for at least one state with some input symbol Jing-Shin Chang

  4. Recognition of Regular Expression • Simulating Deterministic Finite Automata (DFA) • initialization: current_state = s0; input_symbol = 1st symbol • while (current_state not in final_states && input_symbol != EOF) • next_state = (current_state, input_symbol) • input_symbol = next_input_symbol • Simulating non-Deterministic Finite Automata (NFA) • Backtrack/Backup: • remember next alternative configuration (current input & next alternative state) when alternative choices are possible • Parallelism: • trace every possible alternatives in parallel • Look-ahead: • look more input symbols to make it deterministic Jing-Shin Chang

  5. Constructing Automata from R.E. • R.E. => NFA (Thompson’s construction) => DFA => State Minimization • R.E. decomposition into basic alphabets & operators • construct FA for basic alphabets • merging FA’s by operator • R.E. => DFA: state_transition <=> position transition in pattern • annotate RE symbols with position labels • get syntax tree of the annotated pattern • compute {nullable, fistpos, lastpos} • compute follow(i) • s0 = firstpos(root) • construct transition function according to follow(i) Jing-Shin Chang

  6. R.E. and Pattern Matching • Naïve Pattern Matching: • Specify the pattern with a regular expression R.E. for each keyword • Construct a FA for each such R.E., and conduct left-to-right matching: • DFA = State_Transition_Table = Construct_DFA(R.E.) • while (input_pointer != EOF) • stop_state = recognize(input_pointer, DFA) • if fail (stop_state not in final_states) : move input pointer by one character if not match • if success (stop_state in final_states) : output matching status & skip over matched pattern upon successful match • Why Is It Slow? • match multiple keywords multiple times • for each keyword, move input pointer backward to the character next to the last begin of matching & reset to initial state on failure, even though some repeated pattern might appear in recently matched partial string • probability of failure is significantly larger than probability of success match in most applications (success or match only a few times) • will therefore start the next matching session by setting the input pointer one character behind the starting position of the previous match most of the time Jing-Shin Chang

  7. R.E. and Pattern Matching • RE vs. Pattern Matching • R.E. <=> FA for recognizing one of a set of keywords/patterns in input string • say “yes” if input string is in Lang(R.E.) (the regular language for the expression) • Pattern Matching (PM): recognizing the occurrence of any keyword/pattern specified in a regular expression within a text document • specify pattern/keywords with a RE • output all occurrences, in addition to saying yes/no Jing-Shin Chang

  8. R.E. and Pattern Matching • Formal Method for Pattern Matching (PM) • Constructing a FA for (single/multi-keyword) PM is equivalent to constructing a FA that recognizes the regular expression: PM = (.* | RE)* , and outputting a keyword upon visiting a final state of the original FA for recognizing RE • RE = K1 | K2 | K3 | … | Kn (the regular expression for all specified keywords) • “.” : any character not in K1 ~ Kn • “.*”: unspecified patterns (or unknown keywords) • Constructing FA1 for recognizing RE = K1 | K2 | … | Kn • equivalent to merging prefixes of the keywords to avoid redundant forward matching => TRIE lexicon tree = a DFA for RE • Constructing FA2 for recognizing PM = (.*|RE)* • extending FA1 by (a) including ‘unknown keywords’ and (2) introducing epsilon-moves from the original final states to original initial states • on matching failure, redundant backward matching can be avoided if a sub-string preceding current input pointer is the prefix of another keyword • failure function: the state (in TRIE) to backoff on failure (!= init. state if the above mentioned sub-string exists and is non-null) • epsilon-moves & failure function make FA2 a NFA, whose DFA counterpart can be simulated by backtracking Jing-Shin Chang

  9. R.E. and Fast Methods for Pattern Matching • Fast Single Keyword Matching [KMP - Knuth, Morris & Pratt 1977] • Reference: [Aho et. al 1986, Ex. 3.26-3.27] • keyword => state_transition_table • reduce repeated matching suggested by keyword pattern • failure function: where to backoff on failure • Fast Multiple Keyword Matching [AC, Cherry 1982] • Reference: [Aho, Ex. 3.31-32] • keywords => TRIE (state_transition_table) • reduce repeated matching suggested by TRIE of the keywords • TRIE • failure function • Boyer & Moore [1977] • Harrison [1971]: Hashing Method Jing-Shin Chang

More Related