380 likes | 776 Views
Finite state automaton (FSA). LING 570 Fei Xia Week 3: 10/8/2007. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A. Hw1. Need to test your code on patas Windows carriage return vs. unix newline Your machine could yield different results
E N D
Finite state automaton (FSA) LING 570 Fei Xia Week 3: 10/8/2007 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
Hw1 • Need to test your code on patas • Windows carriage return vs. unix newline • Your machine could yield different results • Compile your code and include the binary • Include the shell scripts • Make sure that your code does not crash • Set the executable bit in *.(sh|pl|py|…) • Including path names in the code could be a problem. • See Bill’s GoPost message: “how to make sure that we can run your code”
Hw1 (cont) • Whitespace means \s+ • Your code should be able to handle empty lines, etc. • make_voc.* runs slowly use a hash table • It is really important to follow the instructions: • Ex: Retain the same line break • Ex: the output of make_voc.sh Should be of 1152 instead > of 1152 1: (‘of’, 1152)
Hw1 (cont) • Grading criteria: • Incorrect input / output format: -2 • Runtime error (e.g., it does not process entire file): -10 -5 for hw1 • Missing shell script: -2 • No binary, Unix-incompatible: 0 only for this time. Later it will be treated as a runtime error. • Grades: 20 students • Average: 57.9 • Median: 60 (12 got 60) • Time spent on the homework: 13 students • Average: 12.2 hours • Median: 10.5 hours
Hw2 • Any questions? • n >= 0
Formal grammar • A formal grammar is a 4-tuple: • Grammars generate languages. • Chomsky Hierarchy: • Unrestricted, context sensitive, context free, regular • There are other types of grammars. • Human languages are beyond context-free.
Formal language • A language is a set of strings. • Regular language: defined by recursion • They can be generated by regular grammars • They can be expressed by regular expressions • Given a regular language, grammar, or expression, how can we tell whether a string belongs to a language? creating an FSA as an acceptor
FSA / FST • It is one of the most important techniques in NLP. • Multiple FSTs can be combined to form a larger, more powerful FST. • Any regular language can be recognized by an FSA. • Any regular relation can be recognized by an FST.
FST Toolkits • AT&T: http://www.research.att.com/~fsmtools/fsm/man.html • NLTK: http://nltk.sf.net/docs.html • ISI: Carmel • …
Outline • Deterministic FSA (DFA) • Non-deterministic FSA (NFA) • Probabilistic FSA (PFA) • Weighted FSA (WFA)
Definition of DFA An automaton is a 5-tuple = • An alphabet input symbols • A finite set of states Q • A start state q0 • A set of final states F • A transition function:
a b b q1 q0 = {a, b} S = {q0, q1} F = {q1} = { q0£ a ! q0, q0£ b ! q1, q1£ b ! q1 } What about q1£ a ?
Representing an FSA as a directed graph • The vertices denote states: • Final states are represented as two concentric circles. • The transitions forms the edges. • The edges are labeled with symbols.
An example a b b q1 q0 b a q2 a a b b a a a b b a b q0 q0 q1 q1 q2 q1 q0 q0 q1 q1 q2 q0
DFA as an acceptor • A string is said to be accepted by an FSA if the FSA is in a final state when it stops working. • that is, there is a path from the initial state to a final state which yields the string. • Ex: does the FSA accept “abab”? • The set of the strings that can be accepted by an FSA is called the language accepted by the FSA.
a b b q1 q0 An example FSA: Regular language: {b, ab, bb, aab, abb, …} Regular expression: a* b+ Regular grammar: q0 a q0 q0 b q1 q1 b q1 q1 ²
NFA • A transition can lead to more than one state. • There could be multiple start states. • Transitions can be labeled with ², meaning states can be reached without reading any input. now the transition function is:
b b q1 q0 b a q2 b NFA example a b b a b b b b a b b q0 q0 q1 q1 q2 q1 q0 q1 q2 q0 q0 q1 q0 q1 q2 q0 q0 q0 q0 q1 q2 q0 q1 q2
Definition of regular expression • The set of regular expressions is defined as follows: (1) Every symbol of is a regular expression (2) ² is a regular expression (3) If r1and r2are regular expressions, so are (r1), r1 r2, r1| r2 , r1* (4) Nothing else is a regular expression.
Regular expression NFA Base case: Concatenation: connecting the final states of FSA1 to the initial state of FSA2 by an ²-translation. Union: Creating a new initial state and add ²-transitions from it to the initial states of FSA1 and FSA2. Kleene closure:
Regular expression NFA (cont) Kleene closure: An example: \d+(\.\d+)?(e\-?\d+)?
Regular grammar and FSA • Regular grammar: • FSA: • Conversion between them Hw3
Relation between DFA and NFA • DFA and NFA are equivalent. • The conversion from NFA to DFA: • Create a new state for each equivalent class in NFA • The max number of states in DFA is 2N, where N is the number of states in NFA. • Why do we need both?
Common algorithms for FSA packages • Converting regular expressions to NFAs • Converting NFAs to regular expressions • Determinization: converting NFA to DFA • Other useful closure properties: union, concatenation, Kleene closure, intersection
So far • A DFA is a 5-tuple: • A NFA is a 5-tuple: • DFA and NFA are equivalent. • Any regular language can be recognized by an FSA. • Regular language Regex NFA DFA Regular grammar
Outline • Deterministic finite state automata (DFA) • Non-deterministic finite state automata (NFA) • Probabilistic finite state automata (PFA) • Weighted Finite state automata (WFA)
b:0.8 a:1.0 q0:0 q1:0.2 An example of PFA F(q0)=0 F(q1)=0.2 I(q0)=1.0 I(q1)=0.0 P(abn)=I(q0)*P(q0,abn,q1)*F(q1) =1.0*1.0*0.8n*0.2
Formal definition of PFA A PFA is • Q: a finite set of N states • Σ: a finite set of input symbols • I: Q R+ (initial-state probabilities) • F: Q R+ (final-state probabilities) • : the transition relation between states. • P: (transition probabilities)
Constraints on function: Probability of a string:
PFA • Informally, in a PFA, each arc is associated with a probability. • The probability of a path is the multiplication of the arcs on the path. • The probability of a string x is the sum of the probabilities of all the paths for x. • Tasks: • Given a string x, find the best path for x. • Given a string x, find the probability of x in a PFA. • Find the string with the highest probability in a PFA • …
Another PFA example: A bigram language model P( BOS w1 w2 … wn EOS) = P(BOS) * P(w1 | BOS) P(w2 | w1) * …. P(wn | wn-1) * P(EOS | wn) Examples: I bought two/to/too books How many states?
Weighted finite-state automata (WFA) • Each arc is associated with a weight. • “Sum” and “Multiplication” can have other meanings. • Ex: weight is –log prob - “multiplication” addition - “Sum” power
Summary • DFA and NFA are 5-tuple: • They are equivalent • Algorithm for constructing NFAs for Regexps • PFA and WFA are 6-tuple: • Existing packages for FSA/FSM algorithms: • Ex: intersection, union, Kleene closure, difference, complementation, …