Hidden Markov Models

Hidden Markov Models 張智星 jang@mirlab.org http://mirlab.org/jang 多媒體資訊檢索實驗室台灣大學資訊工程系

Hidden Markov Models (HMM) • 目標 • 以統計的方式來建立每個類別的（動態）機率模型 • 此種模型特別適用於長度不固定的輸入向量 • 特性 • 常用於語者無關的大字彙語音辨識系統 • 分類 • DHMM: Discrete HMM • CHMM: Continuous HMM

HMM 模型的基本概念 • 數字「九」的模型： 0.7 0.8 1.0 0.3 0.2 s2 s3 s1 ㄐㄧㄡ Frames

Basic Concept of HMM • HMM for the pronunciation “four” 0.9 0.6 0.8 0.9 1.0 sil f ao r sil 0.2 0.1 0.4 0.1 Frames:

HMM 相關參數 • 數字「九」的 HMM 模型的相關參數 • Transition Probability A: A(i,j)是從state i 跳到 state j 的機率 • State Probability B: B(i,j)是 frame i 隸屬於 state j 的機率 0.7 0.8 1.0 0.3 0.2 s2 s3 s1 ㄐㄧㄡ

Parameters of an HMM • Parmeters for HMM of “four” • Transition Probability A: A(i,j) is the probability of moving from states i to j • State Probability B: B(k,j) is the probability that token k comes from state j 0.9 0.6 0.8 0.9 1.0 sil f ao r sil 0.2 0.1 0.6 0.1 2 3 4 5 1

HMM for Digit Recognition • Tasks: Recognition of the utterances of “0”,“1”, “2”, …, “9”. • Approach • Construct HMMs for “0”, “1”, …, “9” • Apply Viterbi Decoding to find the probability of an utterance with respect to each of the HMMs • HMM with the highest probability is the predicted digit

利用 HMM 進行辨識 • 以中文數字辨識為例 • 建立 0, 1, 2,…,9 各個數字的語音 HMM 模型（建立方法將在後面說明）。 • 利用 Viterbi Decoding 的方法，計算輸入語句和每個模型的相似度，機率最高者，即為辨識結果。 • 每一個音框的特徵向量都是39維的 MFCC。

Viterbi Decoding • 目標 • 計算輸入語句和每個模型的相似度 • 找出如何分配音框至各個狀態，才能得到最大的機率值 0.7 0.8 1.0 0.3 0.2 s2 s3 s1 ㄐㄧㄡ Frames

Viterbi Decoding • Compute the prob. of an utterance w.r.t. each HMM. • Find the best way of distributing frames into states such that the overall probability is maximized  DP 0.7 0.8 1.0 0.3 0.2 f ao r Frames:

Viterbi Decoding • 方法：Dynamic programming j ㄡ States ㄧ i ㄐ Frames 遞迴算式（based on log prob.）：路徑限制：

Viterbi Decoding • Method: Dynamic programming j r States ao i f Frames Recurrence (based on log prob.): Local constraint:

Viterbi Decoding • 有關於 B(k,j)的定義 • 在 DHMM，B(k,j)是由一個矩陣所定義，其中音框i必須先代換成相對應的群聚 k = O(i) ，然後在經由查表得到音框i屬於狀態j的機率B(O(i),j)。 • 在 CHMM，音框i屬於狀態j的機率B(O(i),j)是由一個機率密度函數所定義，可寫成 pj(xi)，其中xi是第i個frame的特徵向量， pj(‧)則是第j個state的GMM函數。

Parameter Estimation • 如何經由大量語料來估測每個HMM模型的最佳參數值 A 和 B 呢？ • 何謂最佳值：對某一個特定模型而言，最佳參數值 A 和 B 應能使語料產生最大的機率（Log prob.）總和 • 方法：Re-estimation • 非常類似 batch k-means (Forgy’s method)的方法，先猜 A 和 B 的值，再反覆進行 Viterbi decoding，然後再重新估算 A 和 B 的值，如此反覆逼近最佳值。

Parameter Estimation • 對所有frame進行VQ，找出對應的symbol（此即為對應的中心點或codeword） • 先猜一組 A 和 B • 反覆進行下列兩步驟，直到收斂 • Viterbi decoding：利用 Viterbi decoding，找出最佳路徑 • Re-estimation：利用最佳路徑，重新估算 A 和 B

Parameter Re-estimation for A • A for a single optimum path of an utterance: • A(1,1)=3/4, A(1,2)=1/4 • A(2,2)=4/5, A(2,3)=1/5 • A(3,3)=1 • The final A is based on all same-HMM utterances 3 States 2 1 Frames

Parameter Re-estimation for B • B for a single optimum path of an utterance: • State 1: B(1,1)=1/4, B(2,1)=1/4, B(3,1)=2/4 • State 2: B(1,2)=3/5, B(2,2)=2/5 • State 3: B(2,3)=2/6, B(4,3)=4/6 • B(k,j)=B(O(i),j) • Rrame i is related to symbol k via k=O(i) • The final B is based on all same-HMM utterances 3 States 2 1 Frame indice 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Symbols k=O(i)

Why It Works? • 假設我們的目標函數可以寫成 P(A,B,Path)，則上述求參數的方法會讓 P(A,B,Path)逐次遞增，直到收斂，因為： • 在 A, B 固定時，Viterbi decoding 會找出最佳的路徑並使 P(A,B,Path)有最大值 • 在路徑（Path）固定時，Re-estimation 會找出最佳的 A, B 值以使 P(A,B,Path)有最大值  Similar to k-means clustering!

Why It Works? • Assume the overall probability is denoted by P(A,B,Path). Then the above procedure will increase P(A,B,Path) until convergence. • When A, B are fixed, Viterbi decoding will find the optimum paths to maximize P(A,B,Path) • When Path is fixed, the re-estimation will find the optimum A and B to maximize P(A,B,Path)

Why Re-estimation of A Works? • 對任一個 state 而言，假設 • p: self-transition prob.（未知） • q: next-transition prob.（未知） • a: self-transition count（已知） • b: next-transition count（已知） • 則我們可以形成下列最佳化問題： • 利用算數平均數大於幾何平均數，我們可得到最佳參數值 p=a/(a+b), q=b/(a+b)

Why Re-estimation of A Works? • For any givestate, assume • p: self-transition prob. (unknown) • q: next-transition prob. (unknown) • a: self-transition count (known) • b: next-transition count (known) • Maximizing P(A, B, Path) is reduced to • By using AM>=GM, we have the optimizing parameters: p=a/(a+b), q=b/(a+b)

Why Re-estimation of B Works? • 對任一個 state 而言，假設 • State Prob: p1, p2, p3, p4（未知） • Symbol count: c1, c2, c3, c4（已知） • 則我們可以形成下列最佳化問題： • 利用算數平均數大於幾何平均數，我們可得到最佳參數值 p1=c1/(c1+c2+c3+c4), p2=c2/(c1+c2+c3+c4) , p3=c3/(c1+c2+c3+c4), p4=c4/(c1+c2+c3+c4)

Why Re-estimation of B Works? • For any given state, assume • State Prob: p1, p2, p3, p4 (unknown) • Symbol count: c1, c2, c3, c4 (known) • Maximizing P(A, B, Path) is reduce to: • By using AM>=GM, we have the optimizing parameters: p1=c1/(c1+c2+c3+c4), p2=c2/(c1+c2+c3+c4) , p3=c3/(c1+c2+c3+c4), p4=c4/(c1+c2+c3+c4)

Initial Values for A andB • Flat start: Use equal-divided paths to estimate A and B initially s2 s3 s1 Utterance 1 Utterance 2 Utterance 3

A 和 B 的起始值 • 以均分的方法來估測 A 和 B 的起始值 s2 s3 s1 第1句「9」第2句「9」第3句「9」

CHMM • 對於 CHMM 而言，B 的 re-estimation 改成使用 EM 的方法來求取 GMM 的最佳參數值，其餘均與 DHMM 相同。

Hidden Markov Models