Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 9. Learning in Bayesian Networks • Learning via Global Optimization of a Criterion • Maximum-likelihood learning • The Expectation Maximization algorithm • Solution for discrete variables using Lagrangian multipliers • General solution for continuous variables • Example: Gaussian PDF • Example: Mixture Gaussian • Example: Bourlard-Morgan NN-DBN Hybrid • Example: BDFK NN-DBN Hybrid • Discriminative learning criteria • Maximum Mutual Information • Minimum Classification Error

What is Learning? Imagine that you are a student who needs to learn how to propagate belief in a junction tree. • Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it. • Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product). • Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.

What is Machine Learning? • Level 1 Learning (Rule-Based): Programmer tells the computer how to behave. This is not usually called “machine learning.” • Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category. • Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.

Learning Criteria

Optimization Methods

Maximum Likelihood Learning in a Dynamic Bayesian Network • Given: a particular model structure • Given: a set of training examples for that model, (bm,om), 1≤m≤M • Estimate all model parameters (pl(b|a), pl(c|a),…) in order to maximize Smlog p(bm,om|l) • Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(bm,om,am,…,qm) for every training token, using sum-product algorithm. a b b c d e f n o o q

Baum’s Theorem(Baum and Eagon, Bull. Am. Math. Soc., 1967)

Expectation Maximization (EM)

EM for a Discrete-Variable Bayesian Network a b b c d e f n o o q

Solution: Lagrangian Method

The EM Algorithm for a Large Training Corpus

EM for Continuous Observations(Liporace, IEEE Trans. Inf. Th., 1982)

Solution: Lagrangian Method

Example: Gaussian(Liporace, IEEE Trans. Inf. Th., 1982)

Example: Mixture Gaussian(Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)

Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)

Pseudo-Priors and Training Priors

Training the Hybrid Model Using the EM Algorithm

The Solution: Q Back-Propagation

Merging the EM and Gradient Ascent Loops

Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)

The Q Function for a BDFK Hybrid

The EM Algorithm for a BDFK Hybrid

Discriminative Learning Criteria

Maximum Mutual Information

An EM-Like Algorithm for MMI

MMI for Databases with Different Kinds of Transcription • If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability. • If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)

Minimum Classification Error(McDermott and Katagiri, Comput. Speech Lang. 1994) • Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM” • This risk definition has two nonlinearities: • Zero-one loss function, u(x). Replace with a differentiable loss function, s(x). • Max. Replace with a “softmax” function, log(exp(a)+exp(b)+exp(c)). • Differentiate the result; train all HMM parameters using error backpropagation.

Summary • What is Machine Learning? • choose an optimality criterion, • find an algorithm that will adjust model parameters to optimize the criterion • Maximum Likelihood • Baum’s theorem: argmax E[log(p)] = argmax[p] • Apply directly to discrete, Gaussian, MG • Nest within EBP for BM and BDFK hybrids • Discriminative Criteria • Maximum Mutual Information (MMI) • Minimum Classification Error (MCE)

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA