CMSC 723 / LING 645: Intro to Computational Linguistics

CMSC 723 / LING 645: Intro to Computational Linguistics. September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee.

  2. Regular Expressions and Finite State Automata • REs: Language for specifying text strings • Search for document containing a string • Searching for “woodchuck” • How much wood would a woodchuckchuck if a woodchuck would chuck wood? • Searching for “woodchucks” with an optional final “s” • Finite-state automata (FSA)(singular: automaton)

  3. Regular Expressions • Basic regular expression patterns • Perl-based syntax (slightly different from other notations for regular expressions) • Disjunctions /[wW]oodchuck/

  4. Regular Expressions • Ranges [A-Z] • Negations [^Ss]

  5. *+ Stephen Cole Kleene Regular Expressions • Optional characters ? ,* and + • ? (0 or 1) • /colou?r/  colororcolour • * (0 or more) • /oo*h!/  oh! or Ooh! or Ooooh! • + (1 or more) • /o+h!/  oh! or Ooh! or Ooooh! • Wild cards .- /beg.n/  begin or began or begun

  6. Regular Expressions • Anchors ^ and $ • /^[A-Z]/  “Ramallah, Palestine” • /^[^A-Z]/  “¿verdad?” “really?” • /\.$/  “It is over.” • /.$/  ? • Boundaries \b and \B • /\bon\b/  “on my way” “Monday” • /\Bon\b/  “automaton” • Disjunction | • /yours|mine/  “it is either yours or mine”

  7. Disjunction, Grouping, Precedence • Column 1 Column 2 Column 3 …How do we express this? /Column [0-9]+ */ /(Column [0-9]+ +)*/ • Precedence • Parenthesis () • Counters * + ? {} • Sequences and anchors the ^my end$ • Disjunction | • REs are greedy!

  8. Perl Commands While ($line=<STDIN>){ if ($line =~ /the/){ print “MATCH: $line”; } }

  9. Writing correct expressions Exercise: Write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

  10. A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b/ /\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

  11. should be _ Advanced operators

  12. Substitutions and Memory Substitute as many times as possible! • Substitutions s/colour/color/ s/colour/color/g Case insensitive matching s/colour/color/i • Memory ($1,$2, etc. refer back to matches) s/([Cc]olour)/$1olor/ /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/

  13. Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

  14. Eliza-style regular expressions Step 1: replace first person with second person references s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ s/\bI(’m| am)\b/YOU ARE/g s/\bmy\b/YOUR/g S/\bmine\b/YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

  15. Finite-state Automata • Finite-state automata (FSA) • Regular languages • Regular expressions

  16. baa! baaa! baaaa! baaaaa! ... /^baa+!$/ a b a a ! q0 q1 q2 q3 q4 finalstate state transition Finite-state Automata (Machines)

  17. q0 a b a ! b a b a a ! 0 1 2 3 4 Input Tape REJECT

  18. q4 q0 q1 q2 q3 q3 a b a a a b a a ! ! 0 1 2 3 4 Input Tape ACCEPT

  19. Finite-state Automata • Q: a finite set of N states • Q = {q0, q1, q2, q3, q4} • : a finite input alphabet of symbols •  = {a, b, !} • q0: the start state • F: the set of final states • F = {q4} • (q,i): transition function • Given state q and input symbol i, return new state q' • (q3,!)  q4

  20. State-transition Tables

  21. D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or rejectindex Beginning of tapecurrent-state  Initial state of machineloopif End of input has been reached thenif current-state is an accept state thenreturn acceptelsereturn rejectelsiftransition-table [current-state, tape[index]] is empty thenreturn rejectelsecurrent-state  transition-table [current-state, tape[index]]index  index + 1end

  22. ! ! b ! b ! b b a a qF Adding a failing state a b a a ! q0 q1 q2 q3 q4

  23. a b a a ! q0 q1 q2 q3 q4 = = = = qF Adding an “all else” arc

  24. Languages and Automata • Can use FSA as a generator as well as a recognizer • Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. • L(M) = {baa!, baaa!, baaaa!, …} • Regular languages vs. non-regular languages

  25. Languages and Automata • Deterministic vs. Non-deterministic FSAs • Epsilon () transitions

  26. Using NFSAs to accept strings • Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point. • Look-ahead: look ahead in input • Parallelism: look at alternatives in parallel

  27. Using NFSAs

  28. Readings for next time • J&M Chapter 3

