Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 7. Dynamic Bayesian Networks: Trees • Bayesian Network = Factored probability mass function • Definitions: Node, Edge, Graph, Directed, Acyclic, Tree, and Graph Semantics • Belief Propagation; Sum-Product Algorithm • Example: hidden Markov model • Dynamic Bayesian Networks = BN with dynamics • Viterbi approximate inference • Continuous-valued observations • Example: two-stream audiovisual speech recognition, with a non-deterministic phoneme-to-viseme mapping

Example of a Bayesian Network • Does knowing a person’s HEIGHT tell you anything useful about the length of his or her HAIR? • H = height • L = hair length • Are H and L independent, i.e., is p(H,L)=p(H)p(L)? H L

Solution to the Example • H and L are approximately conditionally independent, given a person’s gender • p(H,L,G) = p(H|G) p(L|G) p(G) • G = “male” or “female” • If you don’t know the gender, then H and L are not independent, because knowing the height can help you to guess the person’s gender • p(H,L) = SG p(H|G)p(L|G)p(G) ≠ p(H) p(L) G H L

Components of a Bayesian Network • Nodes = Random Variables • Directed Edges (Arrows) = Dependencies among variables • Parent: a variable’s “parents” are the variables upon which it is dependent, e.g., G is the parent of both H and L (H and L are “sister nodes;” H and L are the “daughters” of G). G H L

Components of a Bayesian Network • Probability distributions: One per variable • Number of columns = number of different values the variable can take • Number of rows = number of different values the variable’s parents can take G H L

Inference in a Bayesian Network • Problem: suppose we know that a person is tall. Estimate the probability that he or she has short hair. • p(L|H) = p(L,H)/SLp(L,H) • p(L,H) = SG p(G,H,L) = SG p(L|G) p(G) p(H|G) G H Unobserved (“Hidden”) Variable (Drawn as a Circle) Observed Variable (Drawn as a Square) H L

A More Interesting Inference Problem • Does knowing a person’s height tell you how much shampoo he or she uses? • S=amount of shampoo; depends on hair length • p(L,H,G,S) = p(S|L) p(L|G) p(H|G) p(G) G H L S

A More Interesting Inference Problem • Suppose we observe that H=tall. Then: p(S=“a lot”, H=“tall”) = SL p(S=“a lot” | L) SG p(L|G) p(G) p(H=“tall”|G) • The Reason for Bayesian Networks: Modularity of the Graph  Modularity of Computation (sum over G, independent of S; then for each S, sum over L) G H H L S

Definition: Graph • Node: any unique identifier (letter, integer, phoneme, word) • Node List: a set of nodes, without repetition • {a,c,b,e,d,f} is a valid Node List • {a,c,c,b,e,d,f,f } is not • Edge: a unique identifier, linked to a pair of nodes • Edge List: a set of edges, without repetition • { 1:ac, 2:cb, 3:cd, 4:ce, 5:df, 6:ef } • Graph: A Node List and an Edge List, such that all edges connect nodes selected from the Node List b d a 3 2 1 5 c e 4 6 f

Directed Graph • Directed graph: a graph in which each edge is directed, i.e., the order of the node pair is important • Example: • Nodes = {a,b,c,d,e,f} • Edges = { 1:ac, 2:bc, 3:cd, 4:ce, 5:df, 6:ef, 7:ce } • Parent node/Mother node: the node listed first on an edge • { a, b, c, c, d, d } are the parents of {c, c, d, e, f, f } • Daughter node: the node listed second on an edge b d a 3 1 2 5 c e 4 6 f

Acyclic Graph • Acyclic Graph: A graph in which there is only one path from any node to any other node • It is always possible to turn a Cyclic Graph into an Acyclic Graph by deleting edges b d a 3 2 1 5 c e 4 f

Tree • A tree is a directed, acyclic graph, with the following extra limitation: • No node has more than one parent • There is a unique root node • There is a unique directed path from the root node to any other node b d a 3 1 2 5 c e 4 f

A Few Trees

Descendants and Non-descendants in a Tree Non-descendants of Node B B Descendants of Node B

Graph Semantics • The “semantics” of a graph is the set of meanings applied to its nodes and edges • Example: Finite State Diagram • Each NODE represents a state that the system can occupy for one period of time • Each EDGE represents a possible state transition • Example: Markov State Diagram • Same as a regular Finite State Diagram, but • There can be more than one edge leaving a node • When multiple edges leave a node, each has an associated probability 0.7 0.9 1.0 0.3 1 0.1 2 3

Bayesian Network • A “Bayesian Network” is a particular SEMANTICS for any directed graph • Each NODE is a random variable • The probability distribution for each node depends only on its parents • Example: • p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b) b d a c e f

Factored Probabilities • Computations using a graph are simplified because the probability mass function (PMF) is factored • Factorization of the PMF is shown by the graph • p(a,b,c,d,e,f) = p(f|d,e) p(d|c) p(e|c) p(c|a,b) p(a) p(b) b d a c e f

Belief Propagation • Belief propagation = • Given knowledge about some of the variables in the graph, • Compute the posterior probabilities of all other variables • Because the PMF is factored, belief propagation is entirely local • For example, given p(f|c), we can compute p(a,f) locally: p(a,f) = SC SB p(f|c) p(c|a,b) p(b) p(a) b d a c e f

Belief Propagation: Easier in Trees b d • In general, belief propagation in an arbitrary graph is difficult (wait until next lecture), but belief propagation in a tree can use an efficient algorithm called the “sum-product algorithm.” a c e f b d a c e f

The Sum-Product Algorithm • Dv = The set of all observed descendants of node v • Nv = The set of all observed nondescendants of node v • Propagate Up: for each variable v, • For each daughter di, compute the sum: p(Ddi| v) = Sdi p(di |v) p( Ddi | di) • Combine different daughters d1,…,dN using the product: p(Dv | v) = Pi p(Ddi | v) • Propagate Down: for each variable v, whose mother is m: • Combine the other daughters of m, s1,…,sM, using the product: p(m, Nv) = p(m, Nm) Pi,si≠v p( Dsi | m) • Compute the sum: p(v, Nv) = Sm p(v | m) p(m, Nv) • Multiply: p(v | observations) = p(v,Nv) p(Dv|v) / Sv p(v,Nv) p(Dv|v)

Example #1: Six-Node Network • Propagating Up: • Nodes g, f, b, and e have no daughters, so we define • p(Dg | g) = 1 • p(Df | f) = 1 • p(De | e) = 1 • p(Db | b) = 1 b b d f g a f f c e

Example #1 • Propagating Down: • p(a,Na) = p(a) b b d f g a f f c e

Example #1 • Propagating Down: • Node c: Product • None: p(a,Nc) = p(a,Na) because c has no sisters • Node c: Sum (Marginalize out the mother) • p(c,Nc) = p(c) = Sa p(c|a) p(a,Nc) b b d f g a f f c e

Example #1 • Propagating Down: • Node d: Product • p(c,Nd) = p(c,Nc) p(Db|c) p(De|c) = p(c) p(b|c) • Node d: Sum • p(d,Nd) = p(d,b) = Sc p(d|c) p(c,b) b b d f g a f f c e

Example #1 • Propagating Down: • Node e: Product • p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c) • Node e: Sum • p(e,Ne) = Sc p(e|c) p(c,Ne) f g b b d f f a c e

Example #1 • Propagating Down: • Node e: Product • p(c,Ne) = p(c,Nc) p(Db|c) p(Dd|c) • Node e: Sum • p(e,Ne) = Sc p(e|c) p(c,Ne) b b d f g a f f c e

… … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • qt = hidden state variable at time t, 1 ≤ t ≤ T • “N states”  1 ≤ qt ≤ N • xt = discrete observation at time t, 1 ≤ t ≤ T • xt quantized to K different levels or codes 1 ≤ xt ≤ K • The “speech recognition problem:” • Several different “word models” available • “Word model” ≡ Parameter values for p(qt|qt-1) and p(xt|qt) • For each word model, compute the likelihood p(x1,…,xT) • The “correct” word is the one with max p(x1,…,xT) x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

… … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • Propagate Down (the “forward algorithm”): • p(q1,Nq1) = p(q1) • … • p(qt+1,Nqt+1) • Product: p(qt,Nqt+1) = p(xt | qt) p(qt, Nqt) • Sum: p(qt+1,Nqt+1) = Sq1 p(qt+1|qt) p(qt,Nqt+1) • … x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

… … Example #2: Hidden Markov Model q1 qt qt+1 qT-1 qT • Multiply: • p(qT|x1,…,xT) = p(qT,NqT)p(DqT | qT) / SqT p(qT,NqT)p(DqT | qT) • The speech recognition problem: • p(x1,…,xT) = SqT p(qT,NqT) p(DqT | qT) • Small-vocabulary isolated word recognition: operations above are repeated for every word model. Choose the word model with maximum p(x1,…,xT) x1 x1 x2 xt x3 xt+1 xT-1 x4 xT x5

Three More Useful Concepts • Dynamic Bayesian Network • Equal to a Bayesian Network with a periodically repeating central portion • Max-Product Algorithm • An approximate form of inference, used to find the hidden variable sequence q1,…,qM that maximizes p(q1,…,qM,x1,…,xN) • Continuous-Valued Random Variables • Continuous-valued observations: PDF replaces PMF; no effect on the sum-product algorithm; works well • Continuous-valued hidden variables: integral replaces sum in the sum-product algorithm; often incomputable

Hidden Markov Model is an Example of a Dynamic Bayesian Network … … • A “Dynamic” Bayesian Network is a Bayesian Network with 3 parts: • The initial part, corresponding to the first frame • q1 and x1 • PMF: q1 has no parents, parent of x1 is q1 • The periodic part, which is duplicated T-2 times in order to match a speech signal with T frames • qt and xt • PMF: The parent of qt is qt-1, the parent of xt is qt • The final part, corresponding to the last frame • qT and xT • In an HMM, the final part is the same as the periodic part, but that’s not true for all DBNs q1 qt qT x1 x1 x2 xt x5 xT

The Max-Product Algorithm • Let the hidden variables in any Bayesian network (dynamic or non-dynamic) be called q1,…,qM • Let the observed variables be called x1,…,xN • The Max-Product Algorithm finds • (q1*,…,qM*)=argmax p(q1,…,qM,x1,…,xN) • Algorithm detail: exactly like the sum-product algorithm, except that every sum is replaced by a maximum • Common use: continuous speech recognition • 1≤qt≤N, where N=Nw*Lw, Nw=number of words, Lw=length of each word • Word wi (0≤i≤Nw-1) is the sequence of states iLw+1 ≤ qt ≤ (i+1)Lw • Max-product algorithm computes the optimum word sequence (w1*,…,wK*)

The Max-Product Algorithm • Dv* = The set of all OPTIMUM descendants of node v • Observed descendants: set to their observed value • Hidden/unobserved descendants: set to their optimum values, d* • Nv* = The set of all OPTIMUM nondescendants of node v • (Nv-m)* = An optimized set including all nondescendants except variable m • Propagate Up: for each hidden variable v, • For each daughter di, compute the max: p(di*, Ddi* | v) = maxdi p(di |v) p( Ddi* | di) • Combine different daughters d1,…,dN using the product: p(Dv* | v) = Pi p(di*, Ddi* | v) • Propagate Down: for each variable v, whose mother is m: • Combine the other daughters of m, s1,…,sM, using the product: p(m, (Nv-m)*) = p(m, Nm*) Pi,si≠v p( di*, Dsi* | m) • Compute the max (choose optimum value of m*): p(v, Nv*) = maxm p(v | m) p(m, (Nv-m)*)

Max-Product Termination • The maximum alignment probability is max p(q1,…,qM,x1,…,xN) = maxv p(v,Nv*) p(Dv* | v) • The optimum hidden variable sequence is the set of all optimum hidden nondescendants of v (stored in the pointer Nv*), plus the set of all optimum hidden descendants of v (stored in the pointer Dv*), plus the optimum value of v itself: (q1*,…,qM*) = argmax p(v,Nv*) p(Dv* | v)

Continuous-Valued Variables • Either the observed variables or the hidden variables can be continuous-valued. • A continuous-valued variable has a probability density function (PDF) instead of a probability mass function (PMF). • Observed variables: • Discrete: variable xi depends on some hidden variables qj,qk, so its PMF is given by the table p(x1|qj,qk) • Continuous: the PDF is written the same way, but now it refers to some function of a continuous-valued variable xi, for example, p(xi|qj,qk) = exp(-(xi-mjk)2/2sjk2) / (sjk√2p) • Computation of the Sum-Product algorithm proceeds without change • Hidden variables: • Now, instead of the sum in the sum-product algorithm, it is necessary to use an integral. • Most such integrals have no analytic solution, and must be approximated numerically. • The only analytically available solution is the Kalman filter, for the case when ALL hidden variables are continuous with Gaussian distribution.

Example #3: Audiovisual Speech Recognition (AVSR) y1 x1 x2 yt yt+1 x2 x5 yT Video observations • A viseme is a group of phonemes with identical lip and tongue blade features, e.g., vt=BILABIAL may correspond to qt in {p,b,m}. • Mapping from phonemes to visemes may be probabilistic, e.g., /o/ may not always look ROUNDED_HIGH, thus p(vt|qt) is a PMF • xt and yt may be discrete or continuous … … v1 vt vt+1 vT Viseme states q1 qt qt+1 qT Audio phoneme states x1 x1 x2 xt xt+1 x2 xT x5 Audio spectral observations

AVSR: Max-Product Algorithm y1 x1 x2 yt yt+1 x2 yT x5 • Propagate Up: product step, multiply together three daughters of qt • p(Dqt* | qt) = p(vt*, Dvt* | qt) p(qt+1*, Dqt+1* | qt) p(xt | qt) … … v1 vt vt+1 vT q1 qt qt+1 qT x1 x1 x2 xt x2 xt+1 x5 xT

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA