1 / 18

Foundations of Software Design

Foundations of Software Design. Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002. Regular Exp. Corresponding Set of Strings. . {""} . a . {"a"} . (ab)*. {"" , "ab", "abab", "ababab"} . a | b | c . {"a", "b", "c"} . (a | b | c)* .

luka
Download Presentation

Foundations of Software Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002

  2. Regular Exp. Corresponding Set of Strings  {""} a {"a"} (ab)* {"" , "ab", "abab", "ababab"} a | b | c {"a", "b", "c"} (a | b | c)* {"", "a", "b", "c", "aa", "ab", ..., "bccabb" ...} Regular Expressions • Language = set of strings • Language is defined by a regular expression • the set of strings that match the expression. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  3. Regex rules (Perl) • Goal: match patterns • // String of characters matches the same string /woodchuck/ “how much wood does a woodchuck chuck?” /p/ “you are a good programmer” /pink elephant/ “this is not a pink elephant. /!/ “Keep you hands to yourself!” • []Disjunction /[wW]ood/ “how much wood does a Woodchuck chuck?” /[abcd]*/ “you are a good programmer” /[A-Za-z]*/ (any letter sequence) • Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT /[^A-Z]*/ (anything not an upper case letter) /a^b/ “look up a^b now”

  4. Regex Rules, cont. • ? The preceding character or nothing /woodchucks?/ “how much wood does a woodchuck chuck?” /behaviou?r/ “behaviour is the British spelling” • * Kleene closure; zero or more occurrences of the preceding character or regular expression /baa*/ ba, baa, baaa, baaaa … /ba*/ b, ba, baa, baaa, baaaa … /[ab]*/ , a, b, ab, ba, baaa, aaabbb, … /[0-9][0-9]*/ any positive integer • + Kleene closure; one or more occurrences of the preceding character or regular expression /ba+/ ba, baa, baaa, baaaa … • . Wildcard; matches any character at that position /p.nt/ pant, pint, punt /cat.*cat/ A string where “cat” appears twice anywhere

  5. Regex Rules, cont. • | Disjunction /(cats?|dogs?)+/ “It’s raining cats and a dog.” • ( ) Grouping /(gupp(y|ies))*/ “His guppy is the king of guppies.” • ^ $ \b Anchors (start, end of the line) /^The/ “The cat in the hat.” /^The end\.$/ “The end.” /^The .* end\.$/ “The bitter end.” /(the)*/ “I saw him the other day.” /(\bthe\b)*/ “I saw him the other day.”

  6. Regexp for Dollars • No commas /$[0-9]+(\.[0-9][0-9])?/ • With commas /$[0-9][0-9]?[0-9]?(,[0-9][0-9][0-9])*(\.[0-9][0-9])?/ • With or without commas /$[0-9][0-9]?[0-9]?((,[0-9][0-9][0-9])*| [0-9]*) (\.[0-9][0-9])?/

  7. Regexps and Substitutions • s/title/<\1>/ title <title> • /the (.*)er they are, the \1er they will be/ The bigger they are, the bigger they will be. • /the (.*)er they (.*), the \1er they \2/ The bigger they were, the bigger they were Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  8. How to Simulate a Therapist • Eliza examples s/.* all .*/ IN WHAT WAY/ s/.* I am (sad | depressed).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* I am (happy | glad).*/I AM GLAD TO HEAR YOU ARE \1/ S/.* always */CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/.*/TELL ME ABOUT YOUR MOTHER/ Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  9. Practice: Regexp’s for email addresses Handle cases like these: hearst@sims.berkeley.edu Wacky@yahoo.com Vip@ic-arda.gov Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  10. Regexps and Scanners • Regular expressions are used to define the language recognized by the scanner for the parser • We create rules in which names stand for regular expressions • Example: • digit: [0-9] • letter: [A-Za-z]

  11. Precedence of operators What is the difference? letter letter | digit* letter (letter | digit)* Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  12. Practicing Regexps • Describe (in English) the language defined by each of the following regular expressions: • letter (letter | digit*) • digit digit* "." digit digit* Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  13. Three Equivalent Representations Regular expressions Each can describe the others Finite automata Regular languages Adapted from Jurafsky & Martin 2000

  14. Finite Automata • A FA is similar to a compiler in that: • A compiler recognizes legal programsin some (source) language. • A finite-state machine recognizes legal stringsin some language. • Example: Pascal Identifiers • sequences of one or more letters or digits, starting with a letter: letter | digit letter S A Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  15. a Finite-Automata State Graphs • A state • The start state • An accepting state • A transition Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  16. Finite Automata • Transition s1as2 • Is read In state s1 on input “a” go to state s2 • If end of input • If in accepting state => accept • Otherwise => reject • If no transition possible (got stuck) => reject • FSA = Finite State Automata Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  17. Language defined by FSA • The language defined by a FSA is the set of strings accepted by the FSA. • in the language of the FSM shown below: • x, tmp2, XyZzy, position27. • not in the language of the FSM shown below: • 123, a?, 13apples. letter | digit letter S A Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

  18. Example: Integer Literals • FSA that accepts integer literals with an optional + or - sign: • Note – two different edges from S to A • \(+|-)?[0-9]+\ digit B digit digit + S A - Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

More Related