Parameterized Finite-State Machines and Their Training

Parameterized Finite-State Machines and Their Training Jason EisnerJohns Hopkins University March 4, 2004 — Saarbrücken

Outline – The Vision Slide! • Finite-state machines as a shared language for models & data. • The training gizmo(an algorithm). Should use out of-the-box finite-state gizmos to build and train most of our current models.Easier, faster, better, & enables fancier models.

c c:z a a:x Unweighted e e:y c:z/.7 c/.7 a:x/.5 a/.5 Weighted .3 .3 e:y/.5 e/.5 Function from strings to ... Acceptors (FSAs) Transducers (FSTs) {false, true} strings numbers (string, num) pairs

Sample functions Acceptors (FSAs) Transducers (FSTs) {false, true} strings Grammatical? Markup Correction Translation Unweighted numbers (string, num) pairs How grammatical? Better, how likely? Good markups Good corrections Good translations Weighted

n a n a b a d a i d Sample data, encoded same way Acceptors (FSAs) Transducers (FSTs) {false, true} strings Input stringCorpusDictionary Bilingual corpus Bilingual lexicon Database (WordNet) Unweighted numbers (string, num) pairs Input latticeReweighted corpusWeighted dictionary Prob. bilingual lexicon Weighted database Weighted

Some Applications • Prediction, classification, generation of text • More generally, “filling in the blanks” (probabilistic reconstruction of hidden data) • Speech recognition • Machine translation, OCR, other noisy-channel models • Sequence alignment / Edit distance / Computational biology • Text normalization, segmentation, categorization • Information extraction • Stochastic phonology/morphology, including lexicon • Tagging, chunking, finite-state parsing • Syntactic transformations (smoothing PCFG rulesets)

Function Function on strings programmer programmer Source code Regular expression a?c* regexpcompiler a compiler c e Object code Finite-state machine determinization, minimization, pruning optimizer Better object code Better object code Finite-state “programming” Programming Langs Finite-State World

Function composition FST/WFST composition Function inversion (available in Prolog) FST inversion ... ... Higher-order functions Finite-state operators Small modular cooperating functions (structured programming) Small modular regexps, combined via operators Finite-state “programming” Programming Langs Finite-State World

Finite-state “programming” More features you wish other languages had! Programming Langs Finite-State World

Currently Two Finite-State Camps

“If you build it, it will train” How About Learning???Training Probabilistic FSMs • State of the world – surprising: • Training for HMMs, alignment, many variants • But no basic training algorithm for all FSAs • Fancy toolkits for building them, but no learning • New algorithm: • Training for FSAs, FSTs, … (collectively FSMs) • Supervised, unsupervised, incompletely supervised … • Train components separately or all at once • Epsilon-cycles OK • Complicated parameterizations OK

p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Current Limitation • Big FSM must be made of separately trainable parts. • Need explicit training data for this part (smaller loanword corpus). • A pity – would like to use guesses. • Topology must be simple enough to train by current methods. • A pity – would like to get some of that expert knowledge in here! • Topology: sensitive to syllable struct? • Parameterization: /t/ and /d/ are similar phonemes … parameter tying?

Probabilistic FSA a/.7 /.5 b/.3 b/.6 a/1 /.5 .4 Example: ab is accepted along 2 paths p(ab) = (.5 .7 .3) + (.5 .6 .4) = .225 Regexp: (a*.7 b) +.5 (ab*.6) Theorem: Any probabilistic FSM has a regexp like this.

a/q /p b/r b/y a/x /w z Weights Need Not be Reals Example: ab is accepted along 2 paths weight(ab) = (pqr)(wxyz) If   * satisfy “semiring” axioms, the finite-state constructions continue to work correctly.

a/q a/r /p b/(1-q)r 1-s a/q*exp(t+u) /1-p a/exp(t+v) Goal: Parameterized FSMs • Parameterized FSM: • An FSM whose arc probabilities depend on parameters: they are formulas. Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.

a/.2 a/.3 /.1 b/.8 .7 a/.44 /.9 a/.56 Goal: Parameterized FSMs • Parameterized FSM: • An FSM whose arc probabilities depend on parameters: they are formulas. Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.

p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Goal: Parameterized FSMs • FSM whose arc probabilities are formulas. “Would like to get some of that expert knowledge in here” Use probabilistic regexps like(a*.7 b) +.5 (ab*.6) … If the probabilities are variables (a*x b) +y (ab*z) …then arc weights of the compiled machine are nasty formulas. (Especially after minimization!)

p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Goal: Parameterized FSMs • An FSM whose arc probabilities are formulas. “/t/ and /d/ are similar …” Tied probs for doubling them: /t/:/tt/ p /d/:/dd/ p

p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Goal: Parameterized FSMs • An FSM whose arc probabilities are formulas. “/t/ and /d/ are similar …” Loosely coupled probabilities: /t/:/tt/ exp p+q+r (coronal, stop,unvoiced) /d/:/dd/ exp p+q+s (coronal, stop,voiced) (with normalization)

Outline of this talk • What can you build with parameterized FSMs? • How do you train them?

p(x) p(y) = = p(x,y) range( ) a : b / 0.3 a : b / 0.3 Finite-State Operations • Projection GIVES YOU marginal distribution p(x,y) domain( )

0.3 p(x) + 0.7 q(x) = p(x) 0.3 q(x) 0.7 Finite-State Operations • Probabilistic union GIVES YOU mixture model p(x) +0.3 q(x)

 p(x) + (1- )q(x) =  p(x) q(x) 1- Finite-State Operations • Probabilistic union GIVES YOU mixture model + p(x) q(x) Learn the mixture parameter !

p(x) q(x) 0.3 p(x) 0.7 Finite-State Operations • Concatenation, probabilistic closureHANDLE unsegmented text *0.3 p(x) q(x) p(x) • Just glue together machines for the different segments, and let them figure out how to align with the text

p(x, noisy y) = D noise model defined by dir. replacement Finite-State Operations • Directed replacement MODELS noise or postprocessing p(x,y) o • Resulting machine compensates for noise or postprocessing

p(x)*q(x) =  p(A(x)|y) p(B(x)|y) p(y) pNB(y | x) & & & Finite-State Operations • Intersection GIVES YOU product models • e.g., exponential / maxent, perceptron, Naïve Bayes, … • Need a normalization op too – computes xf(x)“pathsum” or “partition function” p(x) & q(x) • Cross-product construction (like composition)

p(y | x) = • Construction:reciprocal(determinize(domain( ))) o p(x,y) p(x,y) not possible for all weighted FSAs Finite-State Operations • Conditionalization (new operation) p(x,y) condit( ) • Resulting machine can be composed with other distributions: p(y | x) * q(x)

Other Useful Finite-State Constructions • Complete graphs YIELD n-gram models • Other graphs YIELD fancy language models (skips, caching, etc.) • Compilation from other formalism  FSM: • Wordlist (cf. trie), pronunciation dictionary ... • Speech hypothesis lattice • Decision tree (Sproat & Riley) • Weighted rewrite rules (Mohri & Sproat) • TBL or probabilistic TBL (Roche & Schabes) • PCFG (approximation!) (e.g., Mohri & Nederhof) • Optimality theory grammars (e.g., Eisner) • Logical description of set (Vaillette; Klarlund)

Function Function on strings programmer programmer Source code Regular expression a?c* regexpcompiler a compiler c e Object code Finite-state machine determinization, minimization, pruning optimizer Better object code Better object code Regular Expression Calculus as a Programming Language Programming Langs Finite-State World

Regular Expression Calculus as a Modelling Language • Oops! Statistical FSMs still done “in assembly language”! • Build machines by manipulating arcs and states • For training, • get the weights by some exogenous procedure and patch them onto arcs • you may need extra training data for this • you may need to devise and implement a new variant of EM • Would rather build models declaratively ((a*.7 b) +.5 (ab*.6))  repl.9((a:(b +.3))*,L,R)

A Simple Example: Segmentation tapirseatgrass  tapirs eat grass? tapir seat grass? tap irse at grass? ... Strategy: build a finite-state model of p(spaced text, spaceless text) Then maximizep(???, tapirseatgrass) Start with a distributionp(English word) a machine D (for dictionary) Construct p(spaced text) (D space)*0.99 D Compose withp(spaceless | spaced) ((space)+(space:))*

train on spaced or spaceless text A Simple Example: Segmentation Strategy: build a finite-state model of p(spaced text, spaceless text) Then maximizep(???, tapirseatgrass) Start with a distributionp(English word) a machine D (for dictionary) Construct p(spaced text) (D space)*0.99 D Compose withp(spaceless | spaced) ((space)+(space:))* D should include novel words:D = KnownWord +0.99 (Letter*0.85 Suffix) Could improve to consider letter n-grams, morphology ... Noisy channel could do more than just delete spaces: Vowel deletion (Semitic); OCR garbling ( cl  d, ri  n, rn  m ...)

Outline • What can you build with parameterized FSMs? • How do you train them? Hint: Make the finite-state machinery do the work.

But really I built it as p(x,y) o p(z|y) 5 free parameters 1 free parameter How Many Parameters? Final machine p(x,z) 17 weights – 4 sum-to-one constraints= 13 apparently free parameters

5 free parameters Even these 6 numbers could be tied ...or derived by formula from a smaller parameter set. 1 free parameter How Many Parameters? But really I built it as p(x,y) o p(z|y)

3 free parameters How Many Parameters? But really I built it as p(x,y) o p(z|y) Really I built this as (a:p)*.7 (b: (p +.2 q))*.5 5 free parameters 1 free parameter

Training a Parameterized FST Given: an expression (or code) to build the FST from a parameter vector  • Pick an initial value of  • Build the FST – implements fast prob. model • Run FST on some training examples to compute an objective function F() • Collect E-counts or gradient F() • Update  to increase F() • Unless we converged, return to step 2

x3 x2 x1 x1 x2 x1 … y1 y3 y2 y1 … y2 y1 Training a Parameterized FST (our current FST, reflecting our current guess of the parameter vector) T = At each training pair (xi, yi), collect E counts or gradients that indicate how to increase p(xi, yi).

What are xi and yi? xi (our current FST, reflecting our current guess of the parameter vector) T = yi

What are xi and yi? xi = banana (our current FST, reflecting our current guess of the parameter vector) T = yi = bandaid

a b n d n b n a a a a i What are xi and yi? xi = fully supervised (our current FST, reflecting our current guess of the parameter vector) T = d yi =

b d n b n a a a i What are xi and yi? a loosely supervised xi =  (our current FST, reflecting our current guess of the parameter vector) T = d yi =

 b n d a a i xi = * = What are xi and yi? unsupervised, e.g., Baum-Welch. Transition seq xi is hidden Emission seq yi is observed (our current FST, reflecting our current guess of the parameter vector) T = d yi =

COMPOSE to get trellis: xi o T o yi = Building the Trellis xi = T = Extracts paths from T thatare compatible with (xi, yi). Tends to unroll loops of T,as in HMMs, but not always. yi =

xi o T o yi = Summing the Trellis Extracts paths from T that are compatible with (xi, yi). Tends to unroll loops of T, as in HMMs, but not always. Let ti = total probability of all paths in trellis = p(xi, yi) This is what we want to increase! xi, yi are regexps (denoting strings or sets of strings) How to compute ti? If acyclic (exponentially many paths): dynamic programming. If cyclic (infinitely many paths): solve sparse linear system.

xi o T o yi = Remark: In principle, FSM minimization algorithm already knows how to compute ti, although not the best method. minimize ( ) = epsilonify ( xi o T o yi ) ti replace all arc labels with  Summing the Trellis Let ti = total probability of all paths in trellis = p(xi, yi). This is what we want to increase!

X:b/.2 IWant:u/.8 X:m/.4 IWant: /.1 .2 .1 observe talk m m :m/.05 IWant: /.1 recoverthink, bycomposition Mama:m .1 X:m/.4 X:m/.4 XX/.032 .2 Example: Baby Think & Baby Talk :m/.05 Mama:m Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/.00005 Mama/.05 Total = .0375555555

X:b/.2 think IWant:u/.8  X:m/.4 IWant: /.1 .1 :m/.05 Mama:m .2 m m talk :m/.05 IWant: /.1 Mama:m .1 X:m/.4 X:m/.4 compose .2 Joint Prob. by Double Composition p(* : mm) = .0375555 = sum of paths

X:b/.2 IWant:u/.8 X:m/.4 IWant: /.1 .1 :m/.05 Mama:m .2 m m talk Joint Prob. by Double Composition think Mama IWant :m/.05 IWant: /.1 Mama:m compose .1 p(* : mm) = .0005 = sum of paths

Parameterized Finite-State Machines and Their Training