510 likes | 533 Views
Regular Expressions and Automata. Lecture #2-2. September 10 2009. Finite State Automata. Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata).
E N D
Regular Expressions and Automata Lecture #2-2 September 10 2009
Finite State Automata • Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata). • FSAs and their close variants are a theoretical foundation of much of the field of NLP.
a b a a ! q0 q1 q2 q3 q4 Finite State Automata • FSAs recognize the regular languages represented by regular expressions • SheepTalk: /baa+!/ • Directed graph with labeled nodes and arc transitions • Five states: q0 the start state, q4 the final state, 5 transitions
a b a a ! q0 q1 q2 q3 q4 Formally • FSA is a 5-tuple consisting of • Q: set of states {q0,q1,q2,q3,q4} • : a finite alphabet of symbols {a,b,!} • q0: a start state • F: a set of accept/final states in Q {q4} • (q,i): a transition function mapping Q x to Q
Recognition • Recognition (or acceptance) is the process of determining whether or not a given input should be accepted by a given machine. • Or… it’s the process of determining if as string is in the language we’re defining with the machine • In terms of REs, it’s the process of determining whether or not a given input matches a particular regular expression. • Traditionally, recognition is viewed as processing an input written on a tape consisting of cells containing elements from the alphabet.
FSA recognizes (accepts) strings of a regular language • baa! • baaa! • baaa! • … • Tape metaphor: a rejected input q0 a b a ! b
Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer. • Until you run out of tape.
q0 q3 q3 q4 q1 q2 q3
Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all unambiguous languages. • To change the machine, you change the table.
Key Points • Crudely therefore… matching strings with regular expressions (ala Perl) is a matter of • translating the expression into a machine (table) and • passing the table to an interpreter
Recognition as Search • You can view this algorithm as a degenerate kind of state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state • Its degenerate because?
Formal Languages • Formal Languages are sets of strings composed of symbols from a finite set of symbols. • Finite-state automate define formal languages (without having to enumerate all the strings in the language) • Given a machine m (such as a particular FSA) L(m) means the formal language characterized by m. • L(Sheeptalk FSA) = {baa!, baaa!, baaaa!, …} (an infinite set)
Generative Formalisms • The term Generative is based on the view that you can run the machine as a generator to get strings from the language. • FSAs can be viewed from two perspectives: • Acceptors that can tell you if a string is in the language • Generators to produce all and only the strings in the language
Three Views • Three equivalent formal ways to look at what we’re up to (not including tables – and we’ll find more…) Regular Expressions Finite State Automata Regular Languages
Determinism • Let’s take another look at what is going on with d-recognize. • In particular, let’s look at what it means to be deterministic here and see if we can relax that notion. • How would our recognition algorithm change? • What would it mean for the accepted language?
Determinism and Non-Determinism • Deterministic: There is at most one transition that can be taken given a current state and input symbol. • Non-deterministic: There is a choice of several transitions that can be taken given a current state and input symbol. (The machine doesn’t specify how to make the choice.)
Non-Deterministic FSAs for SheepTalk a b a a ! q0 q1 q2 q3 q4 b a a ! q0 q1 q2 q3 q4
FSAs as Grammars for Natural Language dr the rev mr pat l. robinson q0 q1 q2 q3 q4 q5 q6 ms hon mrs Can you use a regexpr to capture this too?
Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction (essentially building “set states” that are reached by following all possible states in parallel) • That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones • It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. • Problems: translating gives us a not very intuitive machine, and this machine has LOTS of states
Non-Deterministic Recognition • In a ND FSA there exists at least one path directed through the machine by a string that is in the language defined by the machine that leads to an accept condition.. • But not all paths directed through the machine by an accept string lead to an accept state. It is OK for some paths to lead to a reject condition. • In a ND FSA no path directed through the machine by a string outside the language leads to an accept condition.
Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • However, being driven to a reject condition by an input does not imply it should be rejected. • Failure occurs only when none of the possible paths lead to an accept state. • This means that the problem of non-deterministic recognition can be thought of as a standard search problem.
The Problem of Choice • Choice in non-deterministic models comes up again and again in NLP. Several Standard Solutions • Backup (search, this chapter) • Save input/state of machine at choice points • If wrong choice, use this saved state to back up and try another choice • Lookahead • Look ahead in the input to help make a choice • Parallelism • Look at all choices in parallel
Backup • After a wrong choice leads to a dead-end (either no input left in a non-accept state, or no legal transitions), return to a previous choice point to pursue another unexplored choice. • Thus, at each choice point, the search process needs to remember the (unexplored) choices. • Standard State Space Search. • State = (FSA node or machine state, tape-position)
Example b a a ! \ a q0 q2 q1 q2 q3 q4
Example Agenda:
Example Agenda:
Example Agenda:
Agenda: Example
Example Agenda:
Example Agenda:
Example Agenda:
Example Agenda:
Example Agenda:
Example Agenda:
Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.
Infinite Search • If you’re not careful such searches can go into an infinite loop. • How?
Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? • More natural solutions • Machines based on construction are too big
Compositional Machines • Formal languages are just sets of strings • Therefore, we can talk about various set operations (intersection, union, concatenation) • This turns out to be a useful exercise
Union • Accept a string in either of two languages
Concatenation • Accept a string consisting of a string from language L1 followed by a string from language L2.
Negation • Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 • Invert all the accept and not accept states in M1 • Does that work for non-deterministic machines?
Intersection • Accept a string that is in both of two specified languages • An indirect construction… • A^B = ~(~A or ~B)
Why Bother? • ‘FSAs can be useful tools for recognizing – and generating – subsets of natural language • But they cannot represent all NL phenomena (Center Embedding: The mouse the cat ... chased died.)