The estimation of stochastic context-free grammars using the Inside-Outside algorithm

The estimation of stochastic context-free grammars using the Inside-Outside algorithm 1998. 10. 16. Oh-Woog Kwon KLE Lab. CSE POSTECH

Contents • Introduction • The Inside-Outside algorithm • Regular versus context-free grammar • Pre-training • The use of grammar minimization • Implementation • Conclusions

Introduction - 1 • HMM => SCFG in speech recognition tasks • The advantages of SCFG’s • Ability to capture embedded structure within speech data • useful at lower levels such as phonological rule system • Learning: a simple extension of the Baum-Welch re-estimation procedure (Inside-Outside algorithm) • Little previous works of SCFG’s in speech • Two factors for the limited interest in speech • The increased power of CFG’s: not useful for natural language • If all of the sentences = finite, CFG = RG • The time complexity of the Inside-Outside algorithm • O(n3) : n = input string length + the # of grammar symbols

Introduction - 2 • Usefulness of CFG’s in NL • The ability to model derivation probabilities > the ability to determine language membership • So, this paper • introduces the Inside-Outside algorithm • compares CFG with RG using the entropy of the language generated by each grammar • Reduction of time complexity of In-Outside algorithm • This paper • describes a novel pre-training algorithm (smaller iteration) • minimizes the number of non-terminal with grammar minimization (GM) : smaller symbols • implements the In-Outside algorithm using a parallel transputer array : smaller input string length

The Inside-Outside algorithm - 1 • Chomsky Normal Form (CNF) in SCFG • Generated observation sequence: O = O1, O2, …, OT • The matrices of parameters • An application of SCFS’s • recognition : • training :

The Inside-Outside algorithm - 2 • Definition of inner (e) and outer (f) probabilities i S Inner probability S i Outer probability i 1 s-1 s t t+1 T

The Inside-Outside algorithm - 3 • Inner probability: be computed bottom-up • Case 1: (s=t) the form i  m • Case 2: (st) the form i  jk i j k s r r+1 t

The Inside-Outside algorithm - 4 • Outer Probability: be computed top-down j j + k i i k 1 r s-1 s t T 1 s t t+1 r T

The Inside-Outside algorithm - 5 • Recognition Process • By setting s=1, t=T, • By setting s=t,

The Inside-Outside algorithm - 6 • Training Process

The Inside-Outside algorithm - 7 • Re-estimation formula for a[i,j,k] and b[i,m]

The Inside-Outside algorithm - 8 • The Inside-Outside algorithm 1. Choose suitable initial values for the A and B matrices 2. Repeat A = … {Equation 20} B = … {Equation 21} P = … {Equation 11} UNTIL change in P is less than a set threshold

Regular versus context-free grammar • Measurements for the comparison • The entropy making an -representation of L • Empirical entropy • Language for the comparison: palindromes • The number of parameters for each grammar • SCFG: N(# of non-terminal), M(# of terminal) => N3+NM • HMM(RG): K(# of states), M(# of terminal) => K2+(M+2)K • Condition for the comparison : N3+NM  K2+(M+2)K • The result (the ability to model derivation probabilities) • SCFG > RG

Pre-training - 1 • Goal: start off with good initial estimates • reducing the number of re-estimation cycles required (40%) • facilitating the generation of a good final model • Pre-training 1. Use Baum-Welch algorithm (O(n2)) to obtain a set of RG rules 2. RG rules (final matrices) => SCFG rules (initial matrices) 3. Start off the Inside-Outside algorithm (O(n3)) with the initial matrices • Time complexity: a n2 + b n3 << c n3 , if b << c

Pre-training - 2 • Modification (RG => SCFG) (a) For each bjk, define Yjk with probability bjk. (b) For each aij, define Xi Ya Xj with probability aij. (c) For each Si, define S  Xi with probability Si. • If Xi Ya Xl with ail, S  Ya Xl with Siail (d) For each Fj, define Xj Ya with probability Fj. • If Yak with bak, Xjk with bak Fj • The remaining zero parameters => RG • all parameters += floor value; (floor value = 1/ # of non-zero parameters) • re-normalization for

The use of grammar minimization - 1 • Goal: detect and eliminate redundant and/or useless symbols • Good grammar: self-embedding • CFG = self-embedding, if a A such that A *wAx and neither w nor x is . • Require more non-terminal symbols • Smaller n: speed up the Inner-Outer algorithm • Constraining the In-Outside algo. • Greedy symbols: take too many non-terminals • Constrains • allocate a non-terminal to each terminal symbol • force the remaining non-terminals to model hidden branching process • Infeasible for practical approaches (i.e. speech): because of inherent ambiguity

The use of grammar minimization - 2 • Two ways for GM incorporated into the In-Outside algo. • First approach: computationally intractable • In-Out algo.: start with fixed maximum symbols • GM: periodically detect and eliminate redundant and useless symbols • Second approach: more practical • In-Out algo.: start with the desired number of non-terminals • GM: periodically(or log P(S) < threshold) detect and reallocate redundant symbols

The use of grammar minimization - 3 • GM algorithm (ad hoc) 1. Detect greedy symbols in bottom-up fashion 1.1 redundant non-terminals are replaced by a single non-terminal 1.2 free the redundant non-terminals (free non-terminals) 1.3 the same rules are collapsed into a single rule by adding their probabilities 2. Fix the parameters of the remaining non-terminals involved in the generation of greedy symbols (excluded from (3) and (4)) 3. For each free non-terminal i, 3.1 b[i,m]= zero, if m is a greedy symbol, randomize b[i,m], otherwise. 3.2 a[i,j,k] = zero, if j and k are the non-terminals of step 2, randomize a[i,j,k], otherwise. 4. Randomize a[i,j,k] : i(the non-terminals of step2), j and k(free non-terminals)

Implementation using transputer array • Goal: • Speed up the In-Outside algo. (100 times faster) • Split the training data into several subsets • The in-Outside algo. works independently on each subset • Implementation Computes the update parameter set and transmits it down the chain to all the others. SUN Control board Transputer 1 Transputer 2 Transputer 64 ... Each tranputer works independently on its own data set.

Conclusions • Usefulness of CFG’s in NL • This paper • introduced the Inside-Outside algorithm in speech recognition • compares CFG with RG using the entropy of the language generated by each grammar in “toy” problem • Reduction of time complexity of In-Outside algorithm • This paper • described a novel pre-training algorithm (smaller iteration) • proposed an ad hoc grammar minimization (GM) : smaller symbols • implemented the In-Outside algorithm using a parallel transputer array : smaller input string length • Further Research • build SCFG models trained from real speech data

The estimation of stochastic context-free grammars using the Inside-Outside algorithm

The estimation of stochastic context-free grammars using the Inside-Outside algorithm

Presentation Transcript

Context-Free Grammars

Stochastic Context Free Grammars

Context-Free Grammars

Context-Free Grammars

Context-Free Grammars

Context-Free Grammars

Context-Free Grammars

Context-Free Grammars

Context Free Grammars

Stochastic Context Free Grammars for RNA Modeling

Context-Free Grammars

Context-Free Grammars

Context-free Grammars

Context Free Grammars

Context-Free Grammars

Context-Free Grammars

Stochastic Context Free Grammars

CONTEXT-FREE GRAMMARS

Context-Free Grammars

Context-Free Grammars

Stochastic Context-Free Grammars for Modeling RNA