Coupling between ASR and MT in Speech-to-Speech Translation

Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar

This Seminar (~35 pages) • Introduction (6 slides) • Ringger’s categorization of Coupling between ASR and NLU (7 slides) • Interfaces in Loose Coupling • 1 best and N-best (5 slides) • Lattices/Confusion Network/Confidence Estimation (9 slides) • Results from literature (4 slides) • Tight Coupling • Ney’s Theory and 2 methods of Implementation (4 slides) • ( Sorry, no FST approaches will be discussed) • Many Bonus Material at the back

History of this presentation • V1: • Draft finished in Mar 1st • Tanja’s comment: • Direct modeling could be skipped. • We could focus on telling why/ASR • Generates the current outputs • Issues in MT searching could be ignored.

History of this presentation (cont.) • V2 – V4: • Followed Tanja’s comment and finished in Mar 19th . • Reviewer’s comment • Too long (70 pages) • Ney’s search formulation is too difficult to follow • V5 – V6 • Significantly trimmed down the presentation • Moved a lot of things to the backup section. • V7 • Incorporated some comments from Alon, Stephan and the class.

4 papers on Coupling of Speech-to-Speech Translation H. Ney, “Speech translation: Coupling of recognition and translation,” in Proc. ICASSP, 1999. S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, “Using word lattice information for a tighter coupling in speech translation systems,” in Proc. ICSLP, 2004. V.H. Quan et al., “Integrated N-best re-ranking for spoken language translation,” in In EuroSpeech, 2005. N. Bertoldi and M. Federico, “A new decoder for spoken language translation based on confusion networks,” in IEEE ASRU Workshop, 2005.

A Conceptual Model of Speech-to-Speech Translation Speech Recognizer Machine Translator Speech Synthesizer Decoding Result(s) Translation waveforms waveforms

Motivation of Tight Coupling between ASR and MT • One best of ASR could be wrong • MT could be benefited from wide range of supplementary information provided by ASR • N-best list • Lattice • Sentenced/Word-based Confidence Scores • E.g. Word posterior probability • Confusion network • Or consensus decoding (Mangu 1999) • MT quality may depend on WER of ASR (?)

Scope of this talk. Speech Recognizer Machine Translator Speech Synthesizer 1-best? N-best? Translation waveforms waveforms Lattice? Confusion network? Loose Coupling/ Tight Coupling

Topics Covered Today • The concept of Coupling • “Tightness” of coupling between ASR and Technology X. (Ringger 95) • Two questions: • What could ASR provide in loose coupling? • Discussion of interfaces between ASR and MT in loose coupling • What is the status of tight coupling? • Ney’s Formulation

Topics not covered • Direct Modeling • Use both features in ASR and MT • Some referred as “ASR and MT unification” • FST approaches • [V7: I only read two papers and couldn’t do the justcice.] • Implication of the MT search algorithms on the coupling • Generation of speech from text.

The Concept of Coupling

Classification of Coupling of ASR and Natural Language Understanding (NLU) • Proposed in Ringger 95, Harper 94 • 3 Dimensions of ASR/NLU • Complexity of the search algorithm • Simple N-gram? • Incrementality of the coupling • On-line? Left-to-right? • Tightness of the coupling • Tight? Loose? Semi-tight?

Tightness of Coupling Tight Semi-Tight Loose

Notes: • Semi-tight coupling could appear as • Feedback loop between ASR and Technology X for the whole utterance of speech • Or Feedback loop between ASR and Technology X for every frame. • The Ringger framework • A good way to understand how speech-based system is developed

Example 1: LM • Someone asserts that ASR has to be used with 13-grams. • In tight-coupling, • A search will be devised to search for the best word sequence with best acoustic score + 13 gram likelihood • In loose coupling • A simple search will be used to generate some outputs (N-best list, lattice etc.), • 13-gram will then use to rescore the output. • In semi-tight coupling • 1, A simple search will be used to generate results • 2, 13 gram will be applied at the word-end only (but exact history will not be stored)

Example 2: Higher order AM • Segmental model assume obs. probability is not conditionally independent. • Someone assert that segmental model is better than just HMM. • Tight coupling: Direct search of the best word sequence using segmental model. • Loose coupling: Use segmental model to rescore • Semi-tight coupling: Hybrid HMM-Segmental model algorithm?

Summary of Coupling between ASR and NLU

Implication on ASR/MT coupling • Generalize many systems • Loose coupling • Any system which uses 1-best, n-best, lattice, or other inputs for 1-way module communication • (Bertoldi 2005) • CMU System (Saleem 2004) • Tight coupling • (Ney 1999) • Semi-tight coupling • (Quan 2005)

Interfaces in Loose Coupling:1-best and N-best

Perspectives • ASR outputs • 1-best results • N-best results • Lattice • Consensus network. • Confidence scores • How ASR generate these outputs? • Why they are generated? • What if there are multiple ASRs? • (and what if their results are combined?) • Note : we are talking about state-lattice now, not word-lattice. 

Origin of the 1-best. • Decoding of HMM-based ASR = Searching the best path in a huge HMM-state lattice. • 1-best ASR result • The best path one could find from backtracking. • State Lattice in ASR (Next page)

Note on 1-best in ASR • Most of the time 1-best Word Sequence • Why? • In LVCSR, storing the backtracking pointer table for state sequence takes a lot of memory (even nowadays) • [Compare this with the number of frames of score one need to be stored] • Usually a backtrack pointer storing • The previous words before the current word • Clever structure dynamically allocate back-tracking pointer table.

What is N-best list? • Traceback not only from the 1st -best, also from the 2nd best and 3rd best, etc. • Pathway: • Directly from search backtrack pointer table • Exact N-best algorithm (Chow 90) • Word pair N-best algorithm (Chow 91) • A* search using Viterbi score as heuristic (Chow 92) • Generate lattice first, then generate N-best from lattice

Interfaces in Loose Coupling:Lattice, Consensus Network and Confidence Estimation

What is Lattice? • A word-based lattice • A compact representation of state-lattice • Only word node (or link) are involved • Difference between N-best and Lattice • Lattice could be compact representation of N-best list.

How lattice is generated? • From the decoding backtracking pointer table • Only record all the links between word nodes. • From N-best list • Become a compact representation of N-best • [sometimes spurious link will be introduced] • Some complicated issue • Triphone contexts • Cause a lot of complicated issue • When lattice is too large • You want to trim it.

Conclusions on lattices • Lattice generation itself could be a complicated issue • Sometimes, what post-processing stage (e.g. MT) will get is pre-filtered, pre-processed results.

Confusion Network and Consensus Hypothesis • Confusion Network: • Or “Sausage Network”. • Or “Consensus Network”

Special Properties • More “local” than lattice • One can apply simple criteria to find the best results • E.g. “consensus decoding” is to apply word-posterior probability on confusion network. • More tractable • In terms of size

Note on Consensus Network: • Note: • Time information might not be preserved in confusion network • The similarity function directly affect the final output of the consensus network. • Other ways to generate confusion network • From the N-best list • Using Rover. • A mixture of voting and adding confidence of word

Confidence Measure • Anything other than likelihood which could tell whether the answer is useful • E.g. • Word posterior probability • P(W|A) • Usually compute using lattices • Language model backoff mode • Other posterior probabilities (frame, sentence)

Interfaces in Loose Coupling:Results from the Literature

General Note • Coupling in SST is still pretty new • Papers are chosen according to whether some outputs have been used • Other techniques such as direct modeling might be mixed into the papers.

N-best list (Quan 2005) • Using N-best list for reranking • Interpolation weights of AM and TM are then optimized. • Summary: • Reranking gives improvements.

Lattices: CMU results (Saleem 2004) • Summary of results • Lattice word error rate improved when lattice density improves • Lattice density and Weight on Acoustic scores turns out to be an important parameter to tune • Too large and small could hurt.

Consensus Network • Bertoldi 2005 is probably the only work on confusion-network based method • Summary of results: • When direct modeling is applied • Consensus Network doesn’t beat N-best method. • Author argues for speed and simplicity of the algorithm

Confidence: Does it help? • According to Zhang 2006, Yes. • Confidence Measure (CM) filtering is used to filter out unnecessary results in N-best • Note: The approaches used is quite different.

Conclusion on Loose Coupling • SR could give a rich set of outputs. • It seems that it is still an unknown what type of output should be used in pipeline. • Currently, it seem to lack of comprehensive experimental studies on which method is the best. • Usage of confusion network and confidence estimation seem to be under-explored.

Comments about Consensus Network • From Stephan: • Reasons not using consensus networks *now* • 1, the consensus network might occasionally give spurious links in each sausage segment. • 2, lattices from the ASR teams could change from time to time. MT teams need time to consume them. • From Alon, Ralf and Stephan: • There are not much big reasons not to use consensus network because essentially it is just another type of network.

Tight Coupling : Theory and Practice

Theory (Ney 1999) Baye’s Rule Introduce f as hidden var. Baye’s Rule Assume x doesn’t depend on target lang. Sum to Max

Layman point of view • Three factors • Pr(e) : target language model • Pr(f|e) : translation model • Pr(x|f) : acoustic model • Note: assumption has been made only the best matching f for e is used.

Comparison with SR • In SR: • Pr(f) : Source language model • In Tight coupling • Pr(f|e), Pr(e) : Translation model and Target language model

Algorithmic Point of View • Brute Force Method: Instead of incorporating LM into standard Viterbi algorithm • Incoporating P(e) and P(f|e) • => Very complicated • The backup slides in the presentation has detail about Ney’s implementations.

Experimental Results in Matusov, Kanthak and Ney 2005 • Summary of the results • Translation quality is only improved by tight coupling when the lattice density is not high. • Same as Saleem 2004, incorporation of acoustic scores help.

Conclusion: Possible Issues of tight coupling • Possibilities: • In SR, source n-gram LM is very closed to the best configuration. • The complexity of the algorithm is too high, approximation is still necessary to make it work. • When the criterion in tight coupling is used. It is possible that the LM and the TM need to be jointly estimated. • The current approaches still haven’t really implement tight-coupling • There might be bugs in the programs.

Conclusion • Two major issues in coupling of SST is discussed • In loose coupling: • Consensus network and Confidence scoring is still not fully utilized • In tight coupling: • The approach seem to be haunted by very high complexity of search algorithm construction

Discussion • Ian: It could be quite difficult to characterize a relationship of WER and BLEU. • Alan ask: Why not jointly optimize translation model and acoustic model? • Arthur: direct modeling could be useful • Stephan: (rephrase) will it really help?

Coupling between ASR and MT in Speech-to-Speech Translation

Coupling between ASR and MT in Speech-to-Speech Translation

Presentation Transcript

Speech-to-Speech Translation: A New Direction for the Speech Industry

Global Speech-to-speech Translation Market 2012-2016

Automatic Speech Recognition (ASR)

Tight Coupling between ASR and MT in Speech-to-Speech Translation

NONLINEAR SOURCE-FILTER COUPLING IN SPEECH AND SINGING

Interfaces between Speech and Non-Speech Audio Technology

Integrating Speech Recognition and Machine Translation

AVIVAVOZ: technologies for speech-to-speech translation

Automatic Speech Recognition (ASR): A Brief Overview

Direct speech and reported speech

Coupling between ASR and MT in Speech-to-Speech Translation

speech in, speech out

The Use of Speech in Speech-to-Speech Translation

Speech Translation on a PDA

Speech Enhancement for ASR

Machine Translation Speech Translation

Speech-to-Speech MT JANUS C-STAR/Nespole!

Speech-to-Speech MT Design and Engineering

Speech-to-Speech MT Design and Engineering

Survey of Speech-to-speech Translation Systems: Who are the players

NONLINEAR SOURCE-FILTER COUPLING IN SPEECH AND SINGING

Speech Recognition and Machine Translation