290 likes | 853 Views
Finite state transducer (FST). LING 570 Fei Xia Week 3: 10/10/2007. Applications of FSTs. ASR Tokenization Stemmer Text normalization Parsing …. Outline. Regular relation Finite-state transducer (FST) Hw3 Carmel: an FST package. Regular relation. Definition of regular relation.
E N D
Finite state transducer (FST) LING 570 Fei Xia Week 3: 10/10/2007
Applications of FSTs • ASR • Tokenization • Stemmer • Text normalization • Parsing • …
Outline • Regular relation • Finite-state transducer (FST) • Hw3 • Carmel: an FST package
Definition of regular relation • The set of regular relations is defined as follows: • For all , {(x, y)} is a regular relation • The empty set is a regular relation • If R1, R2 are regular relations, so are R1¢ R2 = {(x1 x2, y1 y2) | (x1, y1) 2 R1, (x2, y2) 2 R2}. R1 R2, and R*. • Nothing else is a regular relation.
Closure properties • Like regular languages, regular relations are closed under • union • concatenation • Kleene closure • Unlike regular languages, regular relations are not closed under • Intersection • difference • complementation
Closure properties (cont) • New operations for regular relations: • Composition: • Projection: extract x or y in the pairs • Inversion: switch the x and y in (x,y) pairs • Take a regular language and create the identity regular relation • Take two regular languages and create the cross product relation
Finite-state transducers • x:y is a notation for a mapping between two alphabets: • An FST processes an input string, and outputs another string as the output. • Finite-state automata equate to regular languages, and FSTs equate to regular relations. • Ex: R = { (an, bn) | n >= 0} is a regular relation. It maps a string of a’s into an equal length string of b’s
b:y a:x q0 q1 An FST R(T) = { (a, x), (ab, xy), (abb, xyy), …}
Definition of FST A FST is • Q: a finite set of states • Σ: a finite set of input symbols • Γ: a finite set of output symbols • I: the set of initial states • F: the set of final states • : the transition relation between states. FSA can be seen as a special case of FST
Definition of transduction • The extended transition relation is the smallest set such that • T transduces a string x into a string y if there exists a path from the initial state to a final state whose input is x and whose output is y:
More FST examples • Lowercase a string of any length • Tokenize a string he said:”Go away.” he said : “ Go away . “ • Convert a word to its morpheme sequence • Ex: cats cat s • POS tagging: • Ex: He called Mary PN V N • Map Arabic numbers to words • Ex: 123 one hundred and twenty three
Operations on FSTs • Union: • Concatenation: • Composition:
b:y a:x q0 q1 x:ε y:z q0 b:z a:² q0 q1 An example of composition operation
FST Algorithms • Recognition: Is a given pair of strings accepted by an FST? • (x,y) yes/no • Composition: Given two FSTs T1and T2 defining regular relations R1 and R2, create the FSTthat computes the composition of R1 and R2. • R1={(x,y)}, R2={(y,z)} {(x,z) | (x,y) 2 R1, (y,z) 2 R2} • Transduction: given an input string and an FST, provide the output as defined by the regular relation? • x y
Weighted FSTs A FST is • Q: a finite set of states • Σ: a finite set of input symbols • Γ: a finite set of output symbols • I: Q R+ (initial-state probabilities) • F: Q R+ (final-state probabilities) • : the transition relation between states. • P: (transition probabilities)
An example: build a unigram tagger P(t1 … tn | w1 … wn) ¼ P(t1|w1) * … * P(tn | wn) Training time: Collect (word, tag) counts, and store P(t | w) in an FST. Test time: in order to choose the best tag sequence, • create an FSA for the input sentence • compose it with the FST. • choose the best path in the new FST
Summary • Finite state transducers specify regular relations • FST closure properties: union, concatenation, composition • FST special operations: • creating regular relations from regular languages (Id, crossproduct); • creating regular languages from regular relations (projection) • FST algorithms • Recognition • Transduction • Composition • … • Not all FSTs can be determinized. • Weighted FSTs are used often in NLP.
Part III: Creating a unigram POS tagger using FSTs • Input: w1 w2 … wn • Output: w1/t1 w2/t2 … wn/tn • Training data: w1/t1 w2/t2 … wn/tn
Major steps • Training time: create an FST from the training data: • calc_unigram_prob.sh: create “word tag prob cnt” • create_fst.sh: create an FST from the unigram_voc • Test time: • Preprocessing: preproc.sh • Decoding (finding the best path): run carmel with some options • Postprocessing: postproc.sh • Calculate tagging accuracy • Write a wrapper
The format of FSA / FST final_state (from_state1 (to_state1 “input_symbol” “output_symbol”? weight?)* ) (from_state2 (to_state2 “input_symbol” “output_symbol”? weight?)* ) … A state can be a number or string. The from_state in the first edge-line is the start state. ² is represented as *e* output_symbol and prob are optional.
An FSA example: fsa1 0 1 2 3 4 5
To use Carmel • carmel fst1 fst2 => return a new fst, which composes fst1 and fst2. • carmel -k N wfst1 => return the N most probable paths • carmel -Ok N wfst1 => return the N most probable output strings
To use Carmel (cont) • cat input_file | carmel –sli fst1 • create a foo_fst that corresponds to the first line in input_file • carmel foo_fst fst1 • Ex: input_file is “they” “can” “fish” • cat input_file | carmel –sri fst1 • create a foo_fst that corresponds to the first line in input_file • carmel fst1 foo_fst • Ex: input_file is “PRO” “AUX” “VERB” • cat input_file | carmel –b –sli fst1