190 likes | 314 Views
Foundations of Software Design. Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002. Regular Exp. Corresponding Set of Strings. . {""} . a . {"a"} . (ab)*. {"" , "ab", "abab", "ababab"} . a | b | c . {"a", "b", "c"} . (a | b | c)* .
E N D
Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002
Regular Exp. Corresponding Set of Strings {""} a {"a"} (ab)* {"" , "ab", "abab", "ababab"} a | b | c {"a", "b", "c"} (a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...} Regular Expressions • Language = set of strings • Language is defined by a regular expression • the set of strings that match the expression. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regex rules (Perl) • Goal: match patterns • // String of characters matches the same string /woodchuck/ “how much wood does a woodchuck chuck?” /p/ “you are a good programmer” /pink elephant/ “this is not a pink elephant. /!/ “Keep you hands to yourself!” • []Disjunction /[wW]ood/ “how much wood does a Woodchuck chuck?” /[abcd]*/ “you are a good programmer” /[A-Za-z]*/ (any letter sequence) • Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT /[^A-Z]*/ (anything not an upper case letter) /a^b/ “look up a^b now”
Regex Rules, cont. • ? The preceding character or nothing /woodchucks?/ “how much wood does a woodchuck chuck?” /behaviou?r/ “behaviour is the British spelling” • * Kleene closure; zero or more occurrences of the preceding character or regular expression /baa*/ ba, baa, baaa, baaaa … /ba*/ b, ba, baa, baaa, baaaa … /[ab]*/ , a, b, ab, ba, baaa, aaabbb, … /[0-9][0-9]*/ any positive integer • + Kleene closure; one or more occurrences of the preceding character or regular expression /ba+/ ba, baa, baaa, baaaa … • . Wildcard; matches any character at that position /p.nt/ pant, pint, punt /cat.*cat/ A string where “cat” appears twice anywhere
Regex Rules, cont. • | Disjunction /(cats?|dogs?)+/ “It’s raining cats and a dog.” • ( ) Grouping /(gupp(y|ies))*/ “His guppy is the king of guppies.” • ^ $ \b Anchors (start, end of the line) /^The/ “The cat in the hat.” /^The end\.$/ “The end.” /^The .* end\.$/ “The bitter end.” /(the)*/ “I saw him the other day.” /(\bthe\b)*/ “I saw him the other day.”
Regexp for Dollars • No commas /$[0-9]+(\.[0-9][0-9])?/ • With commas /$[0-9][0-9]?[0-9]?(,[0-9][0-9][0-9])*(\.[0-9][0-9])?/ • With or without commas /$[0-9][0-9]?[0-9]?((,[0-9][0-9][0-9])*| [0-9]*) (\.[0-9][0-9])?/
Regexps and Substitutions • s/title/<\1>/ title <title> • /the (.*)er they are, the \1er they will be/ The bigger they are, the bigger they will be. • /the (.*)er they (.*), the \1er they \2/ The bigger they were, the bigger they were Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
How to Simulate a Therapist • Eliza examples s/.* all .*/ IN WHAT WAY/ s/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* I am (happy | glad).*/I AM GLAD TO HEAR YOU ARE \1/ S/.* always */CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/.*/TELL ME ABOUT YOUR MOTHER/ Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Practice: Regexp’s for email addresses Handle cases like these: hearst@sims.berkeley.edu Wacky@yahoo.com Vip@ic-arda.gov Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regexps and Scanners • Regular expressions are used to define the language recognized by the scanner for the parser • We create rules in which names stand for regular expressions • Example: • digit: [0-9] • letter: [A-Za-z]
Precedence of operators What is the difference? letter letter | digit* letter (letter | digit)* Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Practicing Regexps • Describe (in English) the language defined by each of the following regular expressions: • letter (letter | digit*) • digit digit* "." digit digit* Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Three Equivalent Representations Regular expressions Each can describe the others Finite automata Regular languages Adapted from Jurafsky & Martin 2000
Finite Automata • A FA is similar to a compiler in that: • A compiler recognizes legal programsin some (source) language. • A finite-state machine recognizes legal stringsin some language. • Example: Pascal Identifiers • sequences of one or more letters or digits, starting with a letter: letter | digit letter S A Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
a Finite-Automata State Graphs • A state • The start state • An accepting state • A transition Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input • If in accepting state => accept • Otherwise => reject • If no transition possible (got stuck) => reject • FSA = Finite State Automata Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Language defined by FSA • The language defined by a FSA is the set of strings accepted by the FSA. • in the language of the FSM shown below: • x, tmp2, XyZzy, position27. • not in the language of the FSM shown below: • 123, a?, 13apples. letter | digit letter S A Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Example: Integer Literals • FSA that accepts integer literals with an optional + or - sign: • Note – two different edges from S to A • \(+|-)?[0-9]+\ digit B digit digit + S A - Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html