340 likes | 654 Views
Decoding Techniques for Automatic Speech Recognition. Florian Metze Interactive Systems Laboratories. Outline. Decoding in ASR Search Problem Evaluation Problem Viterbi Algorithm Tree Search Re-Entry Recombination. The ASR problem: arg W max p(W| x ). Two major knowledge sources
E N D
Decoding Techniques for Automatic Speech Recognition Florian Metze Interactive Systems Laboratories
Outline • Decoding in ASR • Search Problem • Evaluation Problem • Viterbi Algorithm • Tree Search • Re-Entry • Recombination ESSLLI 2002, Trento
The ASR problem: argW max p(W|x) • Two major knowledge sources • Acoustic Model: p(x|W) • Language Model: P(W) • Bayes: p(W|x)P(x)=p(x|W)P(W) • Search problem: argW max p(x|W)P(W) • p(x|W) consists of Hidden Markov Models: • Dictionary defines state sequence: „hello“ = /hh eh l ow/ • Full model: concatenation of states (i.e. sounds) ESSLLI 2002, Trento
Target Function/ Measure • %WER = minimum editing distance between reference and hypothesis • Example: • the quick brown fox jumps * over REF • * quick brown fox jump is over HYP • D S I ERR • WER = 3/7 = 43% • Different measure from max p(W|x)!!! ESSLLI 2002, Trento
A simpler problem: Evaluation • So far we have: • Dictionary: “hello” = /hh eh l ow/ … • Acoustic Model: phh(x), peh(x), pl(x), pow(x) … • Language Model: P(“hello world”) • State sequence: /hh eh l ow w er l d/ • Given W and x:Alignment needed! / hh eh l ow / ESSLLI 2002, Trento
A simpler problem: Evaluation • So far we have: • Dictionary: “hello” = /hh eh l ow/ … • Acoustic Model: phh(x), peh(x), pl(x), pow(x) … • Language Model: P(“hello world”) • State sequence: /hh eh l ow w er l d/ • Given W and x:Alignment needed! / hh eh l ow / ESSLLI 2002, Trento
The Viterbi Algorithm • Beam search from left to right • Resulting alignment is best match given p?(x) and x hh eh l ow ESSLLI 2002, Trento
The Viterbi Algorithm (cont‘d) • Evaluation problem: ~ Dynamic Time Warping • Best alignment for given W, x, and p?(x) by locally adding scores (=-log p) for states and transitions hh eh l ow ESSLLI 2002, Trento
Pronunciation Prefix Trees (PPT) • Tree Representation of the Search Dictionary • Very compact fast! • Viterbi Algorithm alsoworks for trees BROADWAY:B R OA D W EY BROADLY:B R OA D L IE BUT: B AH T ESSLLI 2002, Trento
Viterbi Search for PPTs • A PPT is traversed in a time-synchronous way • Apply Viterbi Algorithm on • state level (sub-phonemic units: –b –m –e) Constrained by HMM Topology • phone level • Constrained by PPT • What do we do when we reach the end of a word? ESSLLI 2002, Trento
Re-Entrant PPTs for continuous speech • Isolated word recognition: • Search terminated in leafs of the PPT • Decoding of word sequences: • Re-enter the PPT and store the Viterbi path using a backpointer-table ESSLLI 2002, Trento
hi I am Candy hello I am Problem: Branching Factor • Imagine sequence of 3 words with 10k vocabulary • 10k ^ 3 = 1000G (potentially) • Not everything will be expanded, of course • Viterbi approximation path recombination: • Given P(Candy | „hi I am“) = P(Candy | „hello I am“) ESSLLI 2002, Trento
Path Recombination At time t : Path1 = w1 .. wN with score s1 Path2 = v1 .. vM with score s2 Where: s1 = p(x1...xt | w1...wN )*P(wi| wi-1 wi-2) s2 = p(x1...xt | v1 ...vM )*P(vi | vi-1 vi-2) In the end, we‘re only interested in the best path! ESSLLI 2002, Trento
Path Recombination (cont‘d) • To expand the search space into a new root: • Pick the path with the best score so far (Viterbi approximation) • Initialize scores and backpointers for the root node according to the best predecessor word • store the leftcontext model information with the last phone from the predecessor(context-dependent acoustic models: /s ih t/ /l ih p/) ESSLLI 2002, Trento
Problem with Re-Entry: • For a correct use of the Viterbi algorithm, the choice of the best path must include the score for the transition from the predecessor word to the successor word • The word identity is not known at the root level, the choice of the best predecessor can therefore not be done at this point ESSLLI 2002, Trento
Consequences • Wrong predecessor words language model information only at leaf level • Wrong word boundaries • The starting point for the successor word is determined without any language model information • Incomplete linguistic information • Open pruning thresholds are needed for beam search ESSLLI 2002, Trento
Three-Pass search strategy • Search on a tree-organized lexicon (PPT) • Aggressive path recombination at word ends • Use linguistic information only approximately • Generate a list of starting words for each frame • Search on a flat-organized lexicon • Fix the word segmentation from the first pass • Full use of language model (often needs a third pass) ESSLLI 2002, Trento
Three-Pass Decoder: Results • Q4g system with cache for acoustic scores: • 4000 acoustic models trained on BN+ESST • 40k Vocabulary • Test on “readBN” data ESSLLI 2002, Trento
One-Pass Decoder: Motivation • The efficient use of all available knowledge sources as early as possible should result in faster decoding • Use the same engine to decode along: • Statistical n-gram language models with arbitrary n • Context-free grammars (CFG) • Word-graphs ESSLLI 2002, Trento
Linguistic states • Linguistic state, examples: • n-1 word history for statistical n-gram LM • Grammar state for CFGs • (lattice node,word history) for word-graphs • To fully use the linguistic knowledge source, the linguistic state has to be kept during decoding • Path recombination has to be delayed until the word identity is known ESSLLI 2002, Trento
Linguistic context assignment • Key idea: establish a linguistic polymorphism for each node of the PPT • Maintain a list of linguistically morphed instances in each node • Each instance stores its own backpointer and scores for each state of the underlying HMM with respect to the linguistic state of that instance ESSLLI 2002, Trento
PPT with linguistically morphed instances W EY R OA D B L IE AH T Typically: 3-gram LM, i.e. P(W) = iP(wi|Wi) P(wi|Wi) = P(broadway| „bullets over“) ESSLLI 2002, Trento
Language Model Lookahead • Since the linguistic state is known, the complete LM information P(W)can be applied to the instances, given the possible successor words for that node of the PPT • Let lct = linguistic context/ state of instance i from node n path(w) = path of word w in the PPT (n,lct) = min w {w | node n path(w)} P(w|lct) score(i) = p(x1...xt | w1...wN)* P(wN-1|...) * (n,lct) ESSLLI 2002, Trento
LM Lookahead (cont‘d) • When the word becomes unique, the exact lm score is already incorporated and no explicit word transitions needs to be computed • The lm scores will be updated on demand, based on a compressed PPT („smearing“ of LM scores) • Tighter pruning thresholds can be used since the language model information is not delayed anymore ESSLLI 2002, Trento
Early Path Recombination • The Path recombination can be performed as soon as the word becomes unique, which is usually a few nodes before reaching the leaf. This reduces the number of unique linguistic contexts and instances • This is particularly effective for cross-word models due the fan-out in the right context models ESSLLI 2002, Trento
One-pass Decoder: Summary • One-Pass decoder based on • One copy of tree with dynamically allocated instances • Early path recombination • Full language model lookahead • Linguistic knowledge sources • Statistical n-grams with n >3 possible • Context free grammars ESSLLI 2002, Trento
Results ESSLLI 2002, Trento
Remarks on speed-up Speed-up ranges from a factor of almost 3 for the readBN task to 1.4 for the meeting data • Speed-up depends strongly on matched domain conditions • Decoder profits from sharp language models • LM Lookahead less effective for weak language models due to unmatched conditions ESSLLI 2002, Trento
Memory usage : Q4g ESSLLI 2002, Trento
Summary • Decoding is time- and memory consuming • Search errors occur when beams too tight (trade-off) or Viterbi assumption violated • State-of-the art: One-pass decoder • Tree-structure for efficiency • Linguistically morphed instances of nodes and leafs • Other approaches exist (stack decoding, a-posteriori decoding, ...) ESSLLI 2002, Trento