1 / 91

Parameterized Finite-State Machines and Their Training

Learn about the vision behind using finite-state machines as a shared language for model training and data processing. This approach enables easier, faster, and better training, leading to the development of more sophisticated models across various applications such as text processing, speech recognition, machine translation, and more.

fabbott
Download Presentation

Parameterized Finite-State Machines and Their Training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameterized Finite-State Machines and Their Training Jason EisnerJohns Hopkins University March 4, 2004 — Saarbrücken

  2. Outline – The Vision Slide! • Finite-state machines as a shared language for models & data. • The training gizmo(an algorithm). Should use out of-the-box finite-state gizmos to build and train most of our current models.Easier, faster, better, & enables fancier models.

  3. c c:z a a:x Unweighted e e:y c:z/.7 c/.7 a:x/.5 a/.5 Weighted .3 .3 e:y/.5 e/.5 Function from strings to ... Acceptors (FSAs) Transducers (FSTs) {false, true} strings numbers (string, num) pairs

  4. Sample functions Acceptors (FSAs) Transducers (FSTs) {false, true} strings Grammatical? Markup Correction Translation Unweighted numbers (string, num) pairs How grammatical? Better, how likely? Good markups Good corrections Good translations Weighted

  5. n a n a b a d a i d Sample data, encoded same way Acceptors (FSAs) Transducers (FSTs) {false, true} strings Input stringCorpusDictionary Bilingual corpus Bilingual lexicon Database (WordNet) Unweighted numbers (string, num) pairs Input latticeReweighted corpusWeighted dictionary Prob. bilingual lexicon Weighted database Weighted

  6. Some Applications • Prediction, classification, generation of text • More generally, “filling in the blanks” (probabilistic reconstruction of hidden data) • Speech recognition • Machine translation, OCR, other noisy-channel models • Sequence alignment / Edit distance / Computational biology • Text normalization, segmentation, categorization • Information extraction • Stochastic phonology/morphology, including lexicon • Tagging, chunking, finite-state parsing • Syntactic transformations (smoothing PCFG rulesets)

  7. Function Function on strings programmer programmer Source code Regular expression a?c* regexpcompiler a compiler c e Object code Finite-state machine determinization, minimization, pruning optimizer Better object code Better object code Finite-state “programming” Programming Langs Finite-State World

  8. Function composition FST/WFST composition Function inversion (available in Prolog) FST inversion ... ... Higher-order functions Finite-state operators Small modular cooperating functions (structured programming) Small modular regexps, combined via operators Finite-state “programming” Programming Langs Finite-State World

  9. Finite-state “programming” More features you wish other languages had! Programming Langs Finite-State World

  10. Currently Two Finite-State Camps

  11. “If you build it, it will train” How About Learning???Training Probabilistic FSMs • State of the world – surprising: • Training for HMMs, alignment, many variants • But no basic training algorithm for all FSAs • Fancy toolkits for building them, but no learning • New algorithm: • Training for FSAs, FSTs, … (collectively FSMs) • Supervised, unsupervised, incompletely supervised … • Train components separately or all at once • Epsilon-cycles OK • Complicated parameterizations OK

  12. p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Current Limitation • Big FSM must be made of separately trainable parts. • Need explicit training data for this part (smaller loanword corpus). • A pity – would like to use guesses. • Topology must be simple enough to train by current methods. • A pity – would like to get some of that expert knowledge in here! • Topology: sensitive to syllable struct? • Parameterization: /t/ and /d/ are similar phonemes … parameter tying?

  13. Probabilistic FSA a/.7 /.5 b/.3 b/.6 a/1 /.5 .4 Example: ab is accepted along 2 paths p(ab) = (.5 .7 .3) + (.5 .6 .4) = .225 Regexp: (a*.7 b) +.5 (ab*.6) Theorem: Any probabilistic FSM has a regexp like this.

  14. a/q /p b/r b/y a/x /w z Weights Need Not be Reals Example: ab is accepted along 2 paths weight(ab) = (pqr)(wxyz) If   * satisfy “semiring” axioms, the finite-state constructions continue to work correctly.

  15. a/q a/r /p b/(1-q)r 1-s a/q*exp(t+u) /1-p a/exp(t+v) Goal: Parameterized FSMs • Parameterized FSM: • An FSM whose arc probabilities depend on parameters: they are formulas. Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.

  16. a/.2 a/.3 /.1 b/.8 .7 a/.44 /.9 a/.56 Goal: Parameterized FSMs • Parameterized FSM: • An FSM whose arc probabilities depend on parameters: they are formulas. Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.

  17. p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Goal: Parameterized FSMs • FSM whose arc probabilities are formulas. “Would like to get some of that expert knowledge in here” Use probabilistic regexps like(a*.7 b) +.5 (ab*.6) … If the probabilities are variables (a*x b) +y (ab*z) …then arc weights of the compiled machine are nasty formulas. (Especially after minimization!)

  18. p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Goal: Parameterized FSMs • An FSM whose arc probabilities are formulas. “/t/ and /d/ are similar …” Tied probs for doubling them: /t/:/tt/ p /d/:/dd/ p

  19. p(English text) o p(English text  English phonemes) o p(English phonemes  Japanese phonemes) o p(Japanese phonemes  Japanese text) Knight & Graehl 1997 - transliteration Goal: Parameterized FSMs • An FSM whose arc probabilities are formulas. “/t/ and /d/ are similar …” Loosely coupled probabilities: /t/:/tt/ exp p+q+r (coronal, stop,unvoiced) /d/:/dd/ exp p+q+s (coronal, stop,voiced) (with normalization)

  20. Outline of this talk • What can you build with parameterized FSMs? • How do you train them?

  21. p(x) p(y) = = p(x,y) range( ) a : b / 0.3 a : b / 0.3 Finite-State Operations • Projection GIVES YOU marginal distribution p(x,y) domain( )

  22. 0.3 p(x) + 0.7 q(x) = p(x) 0.3 q(x) 0.7 Finite-State Operations • Probabilistic union GIVES YOU mixture model p(x) +0.3 q(x)

  23.  p(x) + (1- )q(x) =  p(x) q(x) 1- Finite-State Operations • Probabilistic union GIVES YOU mixture model + p(x) q(x) Learn the mixture parameter !

  24. o z p(x|z) = p(x,z) p(x|y) o p(y|z) = Finite-State Operations • Composition GIVES YOU chain rule p(x|y) o p(y|z) • The most popular statistical FSM operation • Cross-product construction

  25. p(x) q(x) 0.3 p(x) 0.7 Finite-State Operations • Concatenation, probabilistic closureHANDLE unsegmented text *0.3 p(x) q(x) p(x) • Just glue together machines for the different segments, and let them figure out how to align with the text

  26. p(x, noisy y) = D noise model defined by dir. replacement Finite-State Operations • Directed replacement MODELS noise or postprocessing p(x,y) o • Resulting machine compensates for noise or postprocessing

  27. p(x)*q(x) =  p(A(x)|y) p(B(x)|y) p(y) pNB(y | x) & & & Finite-State Operations • Intersection GIVES YOU product models • e.g., exponential / maxent, perceptron, Naïve Bayes, … • Need a normalization op too – computes xf(x)“pathsum” or “partition function” p(x) & q(x) • Cross-product construction (like composition)

  28. p(y | x) = • Construction:reciprocal(determinize(domain( ))) o p(x,y) p(x,y) not possible for all weighted FSAs Finite-State Operations • Conditionalization (new operation) p(x,y) condit( ) • Resulting machine can be composed with other distributions: p(y | x) * q(x)

  29. Other Useful Finite-State Constructions • Complete graphs YIELD n-gram models • Other graphs YIELD fancy language models (skips, caching, etc.) • Compilation from other formalism  FSM: • Wordlist (cf. trie), pronunciation dictionary ... • Speech hypothesis lattice • Decision tree (Sproat & Riley) • Weighted rewrite rules (Mohri & Sproat) • TBL or probabilistic TBL (Roche & Schabes) • PCFG (approximation!) (e.g., Mohri & Nederhof) • Optimality theory grammars (e.g., Eisner) • Logical description of set (Vaillette; Klarlund)

  30. Function Function on strings programmer programmer Source code Regular expression a?c* regexpcompiler a compiler c e Object code Finite-state machine determinization, minimization, pruning optimizer Better object code Better object code Regular Expression Calculus as a Programming Language Programming Langs Finite-State World

  31. Regular Expression Calculus as a Modelling Language • Oops! Statistical FSMs still done “in assembly language”! • Build machines by manipulating arcs and states • For training, • get the weights by some exogenous procedure and patch them onto arcs • you may need extra training data for this • you may need to devise and implement a new variant of EM • Would rather build models declaratively ((a*.7 b) +.5 (ab*.6))  repl.9((a:(b +.3))*,L,R)

  32. A Simple Example: Segmentation tapirseatgrass  tapirs eat grass? tapir seat grass? tap irse at grass? ... Strategy: build a finite-state model of p(spaced text, spaceless text) Then maximizep(???, tapirseatgrass) Start with a distributionp(English word) a machine D (for dictionary) Construct p(spaced text) (D space)*0.99 D Compose withp(spaceless | spaced) ((space)+(space:))*

  33. train on spaced or spaceless text A Simple Example: Segmentation Strategy: build a finite-state model of p(spaced text, spaceless text) Then maximizep(???, tapirseatgrass) Start with a distributionp(English word) a machine D (for dictionary) Construct p(spaced text) (D space)*0.99 D Compose withp(spaceless | spaced) ((space)+(space:))* D should include novel words:D = KnownWord +0.99 (Letter*0.85 Suffix) Could improve to consider letter n-grams, morphology ... Noisy channel could do more than just delete spaces: Vowel deletion (Semitic); OCR garbling ( cl  d, ri  n, rn  m ...)

  34. Outline • What can you build with parameterized FSMs? • How do you train them? Hint: Make the finite-state machinery do the work.

  35. But really I built it as p(x,y) o p(z|y) 5 free parameters 1 free parameter How Many Parameters? Final machine p(x,z) 17 weights – 4 sum-to-one constraints= 13 apparently free parameters

  36. 5 free parameters Even these 6 numbers could be tied ...or derived by formula from a smaller parameter set. 1 free parameter How Many Parameters? But really I built it as p(x,y) o p(z|y)

  37. 3 free parameters How Many Parameters? But really I built it as p(x,y) o p(z|y) Really I built this as (a:p)*.7 (b: (p +.2 q))*.5 5 free parameters 1 free parameter

  38. Training a Parameterized FST Given: an expression (or code) to build the FST from a parameter vector  • Pick an initial value of  • Build the FST – implements fast prob. model • Run FST on some training examples to compute an objective function F() • Collect E-counts or gradient F() • Update  to increase F() • Unless we converged, return to step 2

  39. x3 x2 x1 x1 x2 x1 … y1 y3 y2 y1 … y2 y1 Training a Parameterized FST (our current FST, reflecting our current guess of the parameter vector) T = At each training pair (xi, yi), collect E counts or gradients that indicate how to increase p(xi, yi).

  40. What are xi and yi? xi (our current FST, reflecting our current guess of the parameter vector) T = yi

  41. What are xi and yi? xi = banana (our current FST, reflecting our current guess of the parameter vector) T = yi = bandaid

  42. a b n d n b n a a a a i What are xi and yi? xi = fully supervised (our current FST, reflecting our current guess of the parameter vector) T = d yi =

  43. b d n b n a a a i What are xi and yi? a loosely supervised xi =  (our current FST, reflecting our current guess of the parameter vector) T = d yi =

  44. b n d a a i xi = * = What are xi and yi? unsupervised, e.g., Baum-Welch. Transition seq xi is hidden Emission seq yi is observed (our current FST, reflecting our current guess of the parameter vector) T = d yi =

  45. COMPOSE to get trellis: xi o T o yi = Building the Trellis xi = T = Extracts paths from T thatare compatible with (xi, yi). Tends to unroll loops of T,as in HMMs, but not always. yi =

  46. xi o T o yi = Summing the Trellis Extracts paths from T that are compatible with (xi, yi). Tends to unroll loops of T, as in HMMs, but not always. Let ti = total probability of all paths in trellis = p(xi, yi) This is what we want to increase! xi, yi are regexps (denoting strings or sets of strings) How to compute ti? If acyclic (exponentially many paths): dynamic programming. If cyclic (infinitely many paths): solve sparse linear system.

  47. xi o T o yi = Remark: In principle, FSM minimization algorithm already knows how to compute ti, although not the best method. minimize ( ) = epsilonify ( xi o T o yi ) ti replace all arc labels with  Summing the Trellis Let ti = total probability of all paths in trellis = p(xi, yi). This is what we want to increase!

  48. X:b/.2 IWant:u/.8 X:m/.4 IWant: /.1 .2 .1 observe talk m m :m/.05 IWant: /.1 recoverthink, bycomposition Mama:m .1 X:m/.4 X:m/.4 XX/.032 .2 Example: Baby Think & Baby Talk :m/.05 Mama:m Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/.00005 Mama/.05 Total = .0375555555

  49. X:b/.2 think IWant:u/.8  X:m/.4 IWant: /.1 .1 :m/.05 Mama:m .2 m m talk :m/.05 IWant: /.1 Mama:m .1 X:m/.4 X:m/.4 compose .2 Joint Prob. by Double Composition p(* : mm) = .0375555 = sum of paths

  50. X:b/.2 IWant:u/.8 X:m/.4 IWant: /.1 .1 :m/.05 Mama:m .2 m m talk Joint Prob. by Double Composition think Mama IWant :m/.05 IWant: /.1 Mama:m compose .1 p(* : mm) = .0005 = sum of paths

More Related