1 / 54

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 7. Dynamic Bayesian Networks: Trees.

prince
Download Presentation

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

  2. Lecture 7. Dynamic Bayesian Networks: Trees • Bayesian Network = Factored probability mass function • Definitions: Node, Edge, Graph, Directed, Acyclic, Tree, and Graph Semantics • Belief Propagation; Sum-Product Algorithm • Example: hidden Markov model • Dynamic Bayesian Networks = BN with dynamics • Viterbi approximate inference • Continuous-valued observations • Example: two-stream audiovisual speech recognition, with a non-deterministic phoneme-to-viseme mapping

  3. Example of a Bayesian Network • Does knowing a person’s HEIGHT tell you anything useful about the length of his or her HAIR? • H = height • L = hair length • Are H and L independent, i.e., is p(H,L)=p(H)p(L)? H L

  4. Solution to the Example • H and L are approximately conditionally independent, given a person’s gender • p(H,L,G) = p(H|G) p(L|G) p(G) • G = “male” or “female” • If you don’t know the gender, then H and L are not independent, because knowing the height can help you to guess the person’s gender • p(H,L) = SG p(H|G)p(L|G)p(G) ≠ p(H) p(L) G H L

  5. Components of a Bayesian Network • Nodes = Random Variables • Directed Edges (Arrows) = Dependencies among variables • Parent: a variable’s “parents” are the variables upon which it is dependent, e.g., G is the parent of both H and L (H and L are “sister nodes;” H and L are the “daughters” of G). G H L

  6. Components of a Bayesian Network • Probability distributions: One per variable • Number of columns = number of different values the variable can take • Number of rows = number of different values the variable’s parents can take G H L

  7. Inference in a Bayesian Network • Problem: suppose we know that a person is tall. Estimate the probability that he or she has short hair. • p(L|H) = p(L,H)/SLp(L,H) • p(L,H) = SG p(G,H,L) = SG p(L|G) p(G) p(H|G) G H Unobserved (“Hidden”) Variable (Drawn as a Circle) Observed Variable (Drawn as a Square) H L

  8. A More Interesting Inference Problem • Does knowing a person’s height tell you how much shampoo he or she uses? • S=amount of shampoo; depends on hair length • p(L,H,G,S) = p(S|L) p(L|G) p(H|G) p(G) G H L S

  9. A More Interesting Inference Problem • Suppose we observe that H=tall. Then: p(S=“a lot”, H=“tall”) = SL p(S=“a lot” | L) SG p(L|G) p(G) p(H=“tall”|G) • The Reason for Bayesian Networks: Modularity of the Graph  Modularity of Computation (sum over G, independent of S; then for each S, sum over L) G H H L S

  10. Definition: Graph • Node: any unique identifier (letter, integer, phoneme, word) • Node List: a set of nodes, without repetition • {a,c,b,e,d,f} is a valid Node List • {a,c,c,b,e,d,f,f } is not • Edge: a unique identifier, linked to a pair of nodes • Edge List: a set of edges, without repetition • { 1:ac, 2:cb, 3:cd, 4:ce, 5:df, 6:ef } • Graph: A Node List and an Edge List, such that all edges connect nodes selected from the Node List b d a 3 2 1 5 c e 4 6 f

  11. Directed Graph • Directed graph: a graph in which each edge is directed, i.e., the order of the node pair is important • Example: • Nodes = {a,b,c,d,e,f} • Edges = { 1:ac, 2:bc, 3:cd, 4:ce, 5:df, 6:ef, 7:ce } • Parent node/Mother node: the node listed first on an edge • { a, b, c, c, d, d } are the parents of {c, c, d, e, f, f } • Daughter node: the node listed second on an edge b d a 3 1 2 5 c e 4 6 f

  12. Acyclic Graph • Acyclic Graph: A graph in which there is only one path from any node to any other node • It is always possible to turn a Cyclic Graph into an Acyclic Graph by deleting edges b d a 3 2 1 5 c e 4 f

  13. Tree • A tree is a directed, acyclic graph, with the following extra limitation: • No node has more than one parent • There is a unique root node • There is a unique directed path from the root node to any other node b d a 3 1 2 5 c e 4 f

  14. A Few Trees

  15. Descendants and Non-descendants in a Tree Non-descendants of Node B B Descendants of Node B

  16. Graph Semantics • The “semantics” of a graph is the set of meanings applied to its nodes and edges • Example: Finite State Diagram • Each NODE represents a state that the system can occupy for one period of time • Each EDGE represents a possible state transition • Example: Markov State Diagram • Same as a regular Finite State Diagram, but • There can be more than one edge leaving a node • When multiple edges leave a node, each has an associated probability 0.7 0.9 1.0 0.3 1 0.1 2 3

  17. Bayesian Network • A “Bayesian Network” is a particular SEMANTICS for any directed graph • Each NODE is a random variable • The probability distribution for each node depends only on its parents • Example: • p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b) b d a c e f

  18. Factored Probabilities • Computations using a graph are simplified because the probability mass function (PMF) is factored • Factorization of the PMF is shown by the graph • p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b) b d a c e f

  19. Belief Propagation • Belief propagation = • Given knowledge about some of the variables in the graph, • Compute the posterior probabilities of all other variables • Because the PMF is factored, belief propagation is entirely local • For example, given p(f|c), we can compute p(a,f) locally: p(a,f) = SC SB p(f|c) p(c|a,b) p(b) p(a) b d a c e f

  20. Belief Propagation: Easier in Trees b d • In general, belief propagation in an arbitrary graph is difficult (wait until next lecture), but belief propagation in a tree can use an efficient algorithm called the “sum-product algorithm.” a c e f b d a c e f

  21. The Sum-Product Algorithm • Dv = The set of all observed descendants of node v • Nv = The set of all observed nondescendants of node v • Propagate Up: for each variable v, • For each daughter di, compute the sum: p(Ddi| v) = Sdi p(di |v) p( Ddi | di) • Combine different daughters d1,…,dN using the product: p(Dv | v) = Pi p(Ddi | v) • Propagate Down: for each variable v, whose mother is m: • Combine the other daughters of m, s1,…,sM, using the product: p(m, Nv) = p(m, Nm) Pi,si≠v p( Dsi | m) • Compute the sum: p(v, Nv) = Sm p(v | m) p(m, Nv) • Multiply: p(v | observations) = p(v,Nv) p(Dv|v) / Sv p(v,Nv) p(Dv|v)

  22. Example #1: Six-Node Network • Propagating Up: • Nodes g, f, b, and e have no daughters, so we define • p(Dg | g) = 1 • p(Df | f) = 1 • p(De | e) = 1 • p(Db | b) = 1 b b d f g a f f c e

  23. Example #1: Six-Node Network • Propagating Up: • Node d: Sums • First daughter: g is observed, so p(Dg | d) = Sg p(g | d) p(Dg | g) = p(g | d) • Second daughter: f is observed, so p(Df | d) = p(f| d) • Product: p(Dd | d) = p(g,f|d) = p(g|d) p(f|d) b b d f g a f f c e

  24. Example #1: Six-Node Network • Propagating Up: • Node d: Sums • First daughter: g is observed, so p(Dg | d) = Sg p(g | d) p(Dg | g) = p(g | d) • Second daughter: f is observed, so p(Df | d) = p(f| d) • Product: p(Dd | d) = p(g,f|d) = p(g|d) p(f|d) b b d f g a f f c e

  25. Example #1: Six-Node Network • Propagating Up: • Node c: Sums • First daughter: p(Db | c) = Sb p(Db | b) p(b | c) = p(b | c) • Second daughter: p(Dd | c) = Sd p(d|c) p(Dd | d) • Third daughter: p(De | c) = Se p(De | e) p(e | c) = Se p(e|c) = 1 • Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c) b b d f g a f f c e

  26. Example #1: Six-Node Network • Propagating Up: • Node c: Sums • First daughter: p(Db | c) = Sb p(Db | b) p(b | c) = p(b | c) • Second daughter: p(Dd | c) = Sd p(d|c) p(Dd | d) • Third daughter: p(De | c) = Se p(De | e) p(e | c) = Se p(e|c) = 1 • Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c) b b d f g a f f c e

  27. Example #1: Six-Node Network • Propagating Up: • Node c: Sums • First daughter: p(Db | c) = Sb p(Db | b) p(b | c) = p(b | c) • Second daughter: p(Dd | c) = Sd p(d|c) p(Dd | d) • Third daughter: p(De | c) = Se p(De | e) p(e | c) = Se p(e|c) = 1 • Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(b|c) p(f,g|c) b b d f g a f f c e

  28. Example #1: Six-Node Network • Propagating Up: • Node c: Sums • First daughter: p(Db | c) = Sb p(Db | b) p(b | c) = p(b | c) • Second daughter: p(Dd | c) = Sd p(d|c) p(Dd | d) • Third daughter: p(De | c) = Se p(De | e) p(e | c) = Se p(e|c) = 1 • Node c: Product: • p(Dc | c) = p(b,f,g|c) = p(Db|c) p(Dd|c) p(De|c) b b d f g a f f c e

  29. Example #1: Six-Node Network • Propagating Up: • Node a: Sums • One Daughter: p(Dc | a) = p(b,f,g|a) = Sc p(c|a) p(Dc|c) • Node a: Product • p(Da | a) = p(Dc | a) b b d f g a f f c e

  30. Example #1 • Propagating Down: • p(a,Na) = p(a) b b d f g a f f c e

  31. Example #1 • Propagating Down: • Node c: Product • None: p(a,Nc) = p(a,Na) because c has no sisters • Node c: Sum (Marginalize out the mother) • p(c,Nc) = p(c) = Sa p(c|a) p(a,Nc) b b d f g a f f c e

  32. Example #1 • Propagating Down: • Node c: Product • None: p(a,Nc) = p(a,Na) because c has no sisters • Node c: Sum (Marginalize out the mother) • p(c,Nc) = p(c) = Sa p(c|a) p(a,Nc) b b d f g a f f c e

  33. Example #1 • Propagating Down: • Node d: Product • p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c) • Node d: Sum • p(d,Nd) = p(d,b) = Sc p(d|c) p(c,b) b b d f g a f f c e

  34. Example #1 • Propagating Down: • Node d: Product • p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c) • Node d: Sum • p(d,Nd) = p(d,b) = Sc p(d|c) p(c,b) b b d f g a f f c e

  35. Example #1 • Propagating Down: • Node e: Product • p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c) • Node e: Sum • p(e,Ne) = Sc p(e|c) p(c,Ne) f g b b d f f a c e

  36. Example #1 • Propagating Down: • Node e: Product • p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c) • Node e: Sum • p(e,Ne) = Sc p(e|c) p(c,Ne) b b d f g a f f c e

  37. Example #1 • Multiply: p(a|b,f,g) = p(a,Na) p(Da|a) / Sd p(a,Na) p(Da|a) p(c|b,f,g) = p(c,Nc) p(Dc|c) / Sc p(c,Nc) p(Dc|c) p(d|b,f,g) = p(d,Nd) p(Dd|d) / Sd p(d,Nd) p(Dd|d) p(e|b,f,g) = p(e,Ne) p(De|e) / Sd p(e,Ne) p(De|e) b b d f g a f f c e

  38. … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • qt = hidden state variable at time t, 1 ≤ t ≤ T • “N states”  1 ≤ qt ≤ N • xt = discrete observation at time t, 1 ≤ t ≤ T • xt quantized to K different levels or codes 1 ≤ xt ≤ K • The “speech recognition problem:” • Several different “word models” available • “Word model” ≡ Parameter values for p(qt|qt-1) and p(xt|qt) • For each word model, compute the likelihood p(x1,…,xT) • The “correct” word is the one with max p(x1,…,xT) x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

  39. … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • Propagate Up (the “backward algorithm”): • p(DqT | qT) = p(xT|qT) • … • p(Dqt | qt) • Sum: p(Dqt+1 | qt) = Sqt+1 p(qt+1 | qt) p(Dqt+1 | qt+1) • Product: p(Dqt | qt) = p(xt | qt) p(Dqt+1 | qt) • … x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

  40. … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • Propagate Down (the “forward algorithm”): • p(q1,Nq1) = p(q1) • … • p(qt+1,Nqt+1) • Product: p(qt,Nqt+1) = p(xt | qt) p(qt, Nqt) • Sum: p(qt+1,Nqt+1) = Sq1 p(qt+1|qt) p(qt,Nqt+1) • … x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

  41. … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • Multiply: • p(qT|x1,…,xT) = p(qT,NqT)p(DqT | qT) / SqT p(qT,NqT)p(DqT | qT) • The speech recognition problem: • p(x1,…,xT) = SqT p(qT,NqT) p(DqT | qT) • Small-vocabulary isolated word recognition: operations above are repeated for every word model. Choose the word model with maximum p(x1,…,xT) x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

  42. Three More Useful Concepts • Dynamic Bayesian Network • Equal to a Bayesian Network with a periodically repeating central portion • Max-Product Algorithm • An approximate form of inference, used to find the hidden variable sequence q1,…,qM that maximizes p(q1,…,qM,x1,…,xN) • Continuous-Valued Random Variables • Continuous-valued observations: PDF replaces PMF; no effect on the sum-product algorithm; works well • Continuous-valued hidden variables: integral replaces sum in the sum-product algorithm; often incomputable

  43. Hidden Markov Model is an Example of a Dynamic Bayesian Network … … • A “Dynamic” Bayesian Network is a Bayesian Network with 3 parts: • The initial part, corresponding to the first frame • q1 and x1 • PMF: q1 has no parents, parent of x1 is q1 • The periodic part, which is duplicated T-2 times in order to match a speech signal with T frames • qt and xt • PMF: The parent of qt is qt-1, the parent of xt is qt • The final part, corresponding to the last frame • qT and xT • In an HMM, the final part is the same as the periodic part, but that’s not true for all DBNs q1 qt qT x1 x1 x2 xt x5 xT

  44. The Max-Product Algorithm • Let the hidden variables in any Bayesian network (dynamic or non-dynamic) be called q1,…,qM • Let the observed variables be called x1,…,xN • The Max-Product Algorithm finds • (q1*,…,qM*)=argmax p(q1,…,qM,x1,…,xN) • Algorithm detail: exactly like the sum-product algorithm, except that every sum is replaced by a maximum • Common use: continuous speech recognition • 1≤qt≤N, where N=Nw*Lw, Nw=number of words, Lw=length of each word • Word wi (0≤i≤Nw-1) is the sequence of states iLw+1 ≤ qt ≤ (i+1)Lw • Max-product algorithm computes the optimum word sequence (w1*,…,wK*)

  45. The Max-Product Algorithm • Dv* = The set of all OPTIMUM descendants of node v • Observed descendants: set to their observed value • Hidden/unobserved descendants: set to their optimum values, d* • Nv* = The set of all OPTIMUM nondescendants of node v • (Nv-m)* = An optimized set including all nondescendants except variable m • Propagate Up: for each hidden variable v, • For each daughter di, compute the max: p(di*, Ddi* | v) = maxdi p(di |v) p( Ddi* | di) • Combine different daughters d1,…,dN using the product: p(Dv* | v) = Pi p(di*, Ddi* | v) • Propagate Down: for each variable v, whose mother is m: • Combine the other daughters of m, s1,…,sM, using the product: p(m, (Nv-m)*) = p(m, Nm*) Pi,si≠v p( di*, Dsi* | m) • Compute the max (choose optimum value of m*): p(v, Nv*) = maxm p(v | m) p(m, (Nv-m)*)

  46. Max-Product Termination • The maximum alignment probability is max p(q1,…,qM,x1,…,xN) = maxv p(v,Nv*) p(Dv* | v) • The optimum hidden variable sequence is the set of all optimum hidden nondescendants of v (stored in the pointer Nv*), plus the set of all optimum hidden descendants of v (stored in the pointer Dv*), plus the optimum value of v itself: (q1*,…,qM*) = argmax p(v,Nv*) p(Dv* | v)

  47. Continuous-Valued Variables • Either the observed variables or the hidden variables can be continuous-valued. • A continuous-valued variable has a probability density function (PDF) instead of a probability mass function (PMF). • Observed variables: • Discrete: variable xi depends on some hidden variables qj,qk, so its PMF is given by the table p(x1|qj,qk) • Continuous: the PDF is written the same way, but now it refers to some function of a continuous-valued variable xi, for example, p(xi|qj,qk) = exp(-(xi-mjk)2/2sjk2) / (sjk√2p) • Computation of the Sum-Product algorithm proceeds without change • Hidden variables: • Now, instead of the sum in the sum-product algorithm, it is necessary to use an integral. • Most such integrals have no analytic solution, and must be approximated numerically. • The only analytically available solution is the Kalman filter, for the case when ALL hidden variables are continuous with Gaussian distribution.

  48. Example #3: Audiovisual Speech Recognition (AVSR) y1 x1 x2 yt yt+1 x2 x5 yT Video observations • A viseme is a group of phonemes with identical lip and tongue blade features, e.g., vt=BILABIAL may correspond to qt in {p,b,m}. • Mapping from phonemes to visemes may be probabilistic, e.g., /o/ may not always look ROUNDED_HIGH, thus p(vt|qt) is a PMF • xt and yt may be discrete or continuous … … v1 vt vt+1 vT Viseme states q1 qt qt+1 qT Audio phoneme states x1 x1 x2 xt xt+1 x2 xT x5 Audio spectral observations

  49. AVSR: Max-Product Algorithm y1 x1 x2 yt yt+1 x2 yT x5 • Propagate Up: maximimization step, for the three daughters of qt • p(xt | qt) • p(qt+1*, Dqt+1* | qt) = maxqt+1 p(Dqt+1* | qt+1) p(qt+1 | qt) • p(vt*, Dvt* | qt) = maxvt p(yt | vt) p(vt | qt) … … v1 vt vt+1 vT q1 qt qt+1 qT x1 x1 x2 xt x2 xt+1 x5 xT

  50. AVSR: Max-Product Algorithm y1 x1 x2 yt yt+1 x2 yT x5 • Propagate Up: product step, multiply together three daughters of qt • p(Dqt* | qt) = p(vt*, Dvt* | qt) p(qt+1*, Dqt+1* | qt) p(xt | qt) … … v1 vt vt+1 vT q1 qt qt+1 qT x1 x1 x2 xt x2 xt+1 x5 xT

More Related