210 likes | 220 Views
Explore confidence measures using word posteriors in large vocabulary speech recognition systems. Learn how to compute word posteriors from word graphs and effectively combine them for better confidence estimates.
E N D
HUMAN AND SYSTEMS ENGINEERING: Confidence Measures Based on Word Posteriors and Word Graphs Sridhar Raghavan1 and Joseph Picone2 Graduate research assistant1, Human and Systems Engineering Professor2,Electrical and Computer Engineering URL: www.isip.msstate.edu/publications/seminars/msstate/2005/confidence/
Abstract • Confidence measure using word posterior: • There is a strong need for determining the confidence of a word hypothesis in a LVCSR systems, because in conventional viterbi decoding the objective function is to reduce the sentence error rate and not the word error rate. • A good estimate of the confidence is the word posterior probability. • The word posteriors can be computed from a word graph. • A forward-backward algorithm can be used to compute the word posteriors.
W Time • Foundation The equation for computing the posterior probability of the word is as follows [Wessel.F]: “The posterior probability of a word hypothesis is the sum of the posterior probabilities of all lattice paths of which the word is a part“ by Lidia Mangu et al[Finding consensus in speech recognition: word error minimization and other applications of confusion networks]
Foundation: continued… We cannot compute the posterior probability directly, so we decompose it into likelihood and priors using Baye’s rule. N There are 6 different ways to reach the node N and 2 different ways to leave N, so we need to obtain the forward probability as well as the backward probability to determine the probability of passing through the node N, and this is where the forward-backward algorithm comes into picture. The numerator is computed using the forward backward algorithm. The denominator term is simply the by product of the forward-backward algorithm.
Scaling Scaling is used to obtain a flat posterior distribution so that the distribution is not dominated by the best path [G. Evermann]. Experimentally it has been determined that (1/language model scale factor) is a good value that can be used to scale down the acoustic model score. The acoustic model is scaled down using the language model scale factor as follows:
a sense the Sil is guest sentence Sil Sil This is the Sil sentence this is test a • How to combine word-posteriors? • The word posteriors corresponding to the same word can be combined in order to obtain a better confidence estimate. There are several ways to do this, and some of the methods are as follows: • Sum up the posteriors of similar words that fall within the same time frame or choose the maximum posterior value among the similar words in the same time frame [F. Wessel, R. Schlüter, K. Macherey, H. Ney. "Confidence Measures for Large Vocabulary Continuous Speech Recognition“]. • Build a confusion network where the entire lattice is mapped into a single linear graph i.e. where the links pass through all the nodes in the same order. Full Lattice Network = quest sense sil is the a sil guest Confusion = Network this this is the test sentence Note: The redundant silent edges can be fused together in the full lattice network before computing the forward-backward probabilities. This will save a lot of computation if there are many silence edges in the lattice.
Some challenges during posterior rescoring! Apparently the word posteriors are not very good estimate of confidence when the WER on the data is very poor. This is described in the paper by G.Evermann & P.C.Woodland [Large Vocabulary Decoding and Confidence Estimation using Word Posterior Probabilities]. The reason is because the posteriors are overestimated since the words in the lattices are not the full set of possible words, and in case of poor WER the lattice will contain a lot of wrong hypothesis. In such a case the depth of the lattice becomes a critical factor in determining the effectiveness of using the confidence measure. The paper mentioned above cites two techniques to solve this problem. 1. A Decision tree based technique 2. A neural network based technique. Different confidence measure techniques are judged on a metric known as normalized cross entropy (NCE).
1/6 a sense 1/6 the 1/6 Sil 1/6 is 2/6 1/6 guest sentence Sil 3/6 2/6 1/6 5/6 Sil This 2/6 4/6 is 3/6 4/6 3/6 the Sil sentence 2/6 this 1/6 is test a 4/6 4/6 • How can we compute the word posterior from a word graph? The word posterior probability is computed by considering the word’s acoustic score, language model score and its position and history in a particular path through the word graph. An example of a word graph is given below, note that the nodes hold the start-stop times and the links hold the word labels, language model score ane acoustic score. quest
quest 1/6 a sense 1/6 the 1/6 Sil 1/6 is 2/6 1/6 guest sentence Sil 3/6 2/6 1/6 5/6 Sil This 2/6 4/6 is 3/6 4/6 3/6 the Sil sentence 2/6 this 1/6 is test a 4/6 4/6 a • Example Let us consider the example as shown below: The values on the links are the likelihoods. Some nodes are outlined with red to signify that they occur at the same time.
Forward-backward algorithm We will use forward-backward type algorithm for determining the link probability. The general equations used to compute the alphas and betas for an HMM are as follows [from any speech text book]: Computing alphas: Step 1: Initialization: In a conventional HMM forward-backward algorithm we would perform the following – We need to use a slightly modified version of the above equation for processing a word graph. The emission probability will be the acoustic score . In our implementation we just initialize the first alpha value with a constant. Since, we use log domain for implementation we assign the first alpha value as 0.
Forward-backward algorithm continue… The α for the first node is 1: Step 2: Induction The alpha values computed in the previous step are used to compute the alphas for the succeeding nodes. Note: Unlike in HMMs where we move from left to right at fixed intervals of time, over here we move from one node to the next based on node indexes which are time aligned.
Forward-backward algorithm continue… Let us see the computation of the alphas from node 2, the alpha for node 1 was initialized as 1 in the previous step during initialization. Node 2: α=1.675E-05 α =0.005025 is 4 Node 3: α =1 3/6 2/6 Sil 1 4/6 3 3/6 3/6 Sil Node 4: this 2 α =0.005 The alpha calculation continues in this manner for all the remaining nodes. The forward backward calculation on word-graphs is similar to the calculations used on HMMs, but in word graphs the transition matrix is populated by the language model probabilities and the emission probability corresponds to the acoustic score.
Forward-backward algorithm continue… Once we compute the alphas using the forward algorithm we begin the beta computation using the backward algorithm. The backward algorithm is similar to the forward algorithm, but we start from the last node and proceed from right to left. Step 1 : Initialization Step 2: Induction
Forward-backward algorithm continue… Let us see the computation of the beta values from node 14 and backwards. Node 14: β=1.66E-5 β=0.001667 1/6 sense 14 1/6 Sil 1/6 11 sentence Sil Node 13: 5/6 13 15 4/6 β=1 sentence β=0.00833 12 Node 12: β=5.55E-5
Forward-backward algorithm continue… Node 11: In a similar manner we obtain the beta values for all the nodes till node 1. The alpha for the last node should be the same as the beta in the first node. We can compute the probabilities on the links (between two nodes) as follows: Let us call this link probability as Γ. Therefore Γ(t-1,t) is computed as the product of α(t-1)*ß(t)*aij. These values give the un-normalized posterior probabilities of the word on the link considering all possible paths through the link.
α=1.2923E-15 β=1.667E-03 α=7.751E-13 β=1.66E-05 α=1.675E-05 β=1.536E-13 α =5.025E-03 β=5.740E-14 quest 1/6 a sense 14 1/6 the 1/6 Sil 8 1/6 α =1 β=2.88E-16 is 2/6 4 1/6 guest 11 sentence Sil 3/6 2/6 1/6 5/6 Sil 6 This 2/6 1 4/6 13 15 3 is 9 4/6 3/6 α=1.861E-10 β=2.766E-8 α=2.88E-16 β=1 3/6 the Sil sentence 5 α=3.438E-14 β=8.33E-03 2/6 this 1/6 α=3.35E-5 β=8.537E-12 is test 2 a 4/6 4/6 12 α =5e-03 β=2.87E-16 7 10 α=4.964E-12 β=5.55E-05 α=7.446E-10 β=3.7E-07 α=1.117E-7 β=2.512E-9 • Word graph showing the computed alphas and betas This word graph shows every node with its corresponding alpha and beta value. α=1.675E-7 β=4.61E-11 α=2.79E-10 β=2.766E-8 Assumption here is that the probability of occurrence of any word is 0.01. i.e. when we have 100 words in a loop grammar
Γ=0.0268 Γ=0.0268 Γ=7.47E-03 quest 1/6 Γ=8.937E-03 a sense 14 1/6 the 1/6 Γ=8.93E-03 Sil 8 1/6 is 2/6 Γ=0.996 4 1/6 Γ=7.478E-03 11 guest sentence Sil 3/6 2/6 1/6 5/6 6 Sil Γ=0.0373 This 2/6 13 1 4/6 Γ=0.0178 15 3 is 9 Γ=0.0178 4/6 3/6 Γ=0.9950 3/6 Γ=0.993 the Sil sentence 5 2/6 this 1/6 Γ=0.0178 is Γ=0.9571 Γ=4.98E-03 test 2 Γ=0.9739 a 4/6 12 4/6 7 10 Γ=0.9566 Γ=0.9566 • Link probabilities calculated from alphas and betas The following word graph shows the links with their corresponding link posterior probabilities normalized by the sum of all paths. Γ=4.98E-03 By choosing the links with the maximum posterior probability we can be certain that we have included most probable words in the final sequence.
Some Alternate approaches… The normalization of the posteriors is done by dividing the value by the sum of the posterior probabilities of all the paths in the lattice. This example suffers from the fact that the lattice is not deep enough, hence normalization might result in the values of some of the links to go very close to 1. This phenomenon is explained in the paper by G.Evermann and P.C Woodland. The paper by F.Wessel(confidence Measures for Large Vocabulary Continuous Speech Recognition)describes alternate techniques to compute the posterior, the drawback of the approach described above is that the lattice has to be very deep to accommodate sufficient links at the same time instant. To overcome the problem we can use a soft time margin instead of a hard margin, and this is achieved by considering overlapping words to a certain degree. But, by doing this the author states that the normalization part will no longer work since the probabilities are not summed in the same time frame, and hence the probabilities will not sum to one. Hence, the author suggests an approach where the posteriors are computed frame-by-frame so that the normalization of the posteriors is possible. In the end it was found that normalization using frame-by-frame approach did not perform significantly better than the overlapping time marks approach.
Logarithmic computations: Instead of using the probabilities as described above, we can use logarithmic approximations of the above probabilities so that the multiplications are converted to additions. We can directly use the acoustic and language model scores from the word graphs. We will use the following log trick to add two logarithmic values: log(x+y) = log(x) + log(1+y/x) The logarithmic alphas and betas computed are shown below: α=-6.2503 β=-1.7917 α=-3.1778 β=-3.5833 α=-4.4586 β=-1.7916 α =-1.3862 β=-6.4736 α=-1.3861 β=-5.375 quest -1.7917 α =-0.2876 β=-2.8562 a sense 13 -1.7917 the -1.7917 -1.7917 7 -1.0986 α =0 β=-3.1438 is -0.6931 3 -1.7917 10 Sil -1.0986 guest Sil p=-4.0224 -1.7917 sentence -0.4054 5 -1.0986 Sil 0 12 14 2 is -0.1823 This 8 -0.4054 α=-3.5833 β=-3.5833 α=-3.1442 β=0 the Sil sentence -0.6931 4 -0.6931 -1.0986 α=-2.9694 β=-0.1823 this -1.7917 α=-0.6930 β=-2.4598 is 1 test a 11 α =-0.6931 β=-3.5493 6 9 -0.4054 -0.4054 α=-2.6024 β=-0.5877 α=-1.7916 β=-1.3799 α=-2.1970 β=-0.9931
quest p=-4.8978 13 p=-1.0986 p=-4.7156 p=-3.6169 p=-3.6169 sense p=--4.8978 7 is 3 the a 10 p=-3.2884 Sil p=-0.4051 guest Sil sentence 5 p=-0.0086 Sil 0 p=-4.0224 12 14 2 is p=-3.604 This 8 p=-0.0459 p=-0.0075 the Sil 4 p=-1.0982 sentence p=-0.0273 this p=-4.0224 is 1 test a 11 6 9 p=-0.0459 p=-0.0459 • Logarithmic posterior probabilities p=-1.0982 From the lattice we can obtain the best word sequence by picking the words with the highest posterior probability as we traverse from node to node.
References: • F. Wessel, R. Schlüter, K. Macherey, H. Ney. "Confidence Measures for Large Vocabulary Continuous Speech Recognition". IEEE Trans. on Speech and Audio Processing. Vol. 9, No. 3, pp. 288-298, March 2001 • Wessel, Macherey, and Schauter, "Using Word Probabilities as Confidence Measures, ICASSP'97 • G. Evermann and P.C. Woodland, “Large Vocabulary Decoding and Confidence Estimation using Word Posterior Probabilities in Proc. ICASSP 2000, pp. 2366-2369, Istanbul. • X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall, ISBN: 0-13-022616-5, 2001 • J. Deller, et. al., Discrete-Time Processing of Speech Signals, MacMillan Publishing Co., ISBN: 0-7803-5386-2, 2000