140 likes | 324 Views
Integrating Speech Recognition and Machine Translation. Spyros Matsoukas, Ivan Bulyko, Bing Xiang, Kham Nguyen, Richard Schwartz, John Makhoul. Integration Issues. Machine Translation (MT) system is trained on text data, so it expects segments that correspond to foreign sentences
E N D
Integrating Speech Recognition and Machine Translation Spyros Matsoukas, Ivan Bulyko, Bing Xiang, Kham Nguyen, Richard Schwartz, John Makhoul
Integration Issues • Machine Translation (MT) system is trained on text data, so it expects • segments that correspond to foreign sentences • properly placed punctuation marks • numbers, dates, monetary amounts, abbreviations, etc., as they appear in ordinary text • However, Speech-To-Text (STT) output • is segmented automatically on long pauses • resulting segments may be too short, or may cross sentence boundaries • has no punctuation • punctuation needs to be automatically added prior to translation • has numbers, dates, etc., in spoken form • output can be parsed to convert numbers to written form
STT/MT Pipeline • Initial set of experiments ran MT on the 1-best hypothesis from STT
STT Components • STT-A • EARS RT04 Arabic BN system • Word pronunciations based on graphemes • Acoustic models estimated using Maximum Mutual Information (MMI) and Speaker Adaptive Training (SAT) on 100 hours of BN audio data • 3-gram language model trained on 400 million words of news text • STT-B • Uses morphological analyzer and automatic methods to infer short vowels in word pronunciations • Trained on an additional 50 hours of acoustic training data • STT-C • Makes use of additional language model training data
MT Components • MT-A • System developed during the period Sep 2004 – Apr 2005 • Phrase-based translation model, trained on 100M words of Arabic/English UN and news bitext • 3-gram English LM, trained on 2 billion words of text (mostly newswire) • Translation based on posterior probability P(English | Foreign) • MT-B • Uses a combination of generative and posterior translation probabilities • Includes a phrase segmentation score • Uses a method to compensate for over-estimated translation probabilities • Optimizes decoding weights by minimizing TER on N-best lists TER results on the 2002 and 2004 MT Eval sets
Test Data • Tested integration on bnat05 • 6-hour set from several sources from Jan 2001 and Nov 2003 • Test set consists of both Modern Standard Arabic (MSA) and Arabic dialect segments • All system comparisons based on TER • MT system output automatically scored against single reference transcription, with mixed case
Integration Results • Effect of STT accuracy, segmentation and punctuation on MT accuracy • At current MT performance level: • large improvements in STT accuracy result in small TER gain • significant TER reduction (2.7% absolute) can be obtained by improving sentence boundary detection • full punctuation helps translation only marginally
Optimizing STT segmentation for MT • Tuned the audio segmentation procedure in order to output segments that match the reference in terms of average length • 1.6% absolute TER gain for optimizing segmentation • Additional gains can be obtained by • Converting spoken numbers to written form prior to translation (0.4-0.5% TER reduction) • re-defining STT output segmentation, using linguistic information
Sentence Boundary Detection (SBD) • Used a hidden-event language model (HELM) to detect sentence boundaries in the 1-best STT output • 4-gram HELM, trained 850M words of Arabic news with Kneser-Ney smoothing • Silence duration can be integrated as observation into HMM search • Explored various configurations • SBD-1: Use only LM to insert periods within speaker turns • SBD-2: Use LM and silence duration jointly • SBD-3: Bias the LM to insert boundaries at a higher rate (by 30-50%), then remove boundaries with lowest model posteriors while constraining the maximum sentence length
SBD Results • Effect of HELM-based SBD on MT accuracy, starting from one of two audio segmentations • audio-seg-1: 9.47 sec / segment • audio-seg-2: 13.60 sec / segment • HELM has larger effect on Modern Standard Arabic (MSA) regions, where STT accuracy is high • SBD can be applied safely on top of any audio segmentation
Optimizing MT on Speech Data • MT accuracy can be enhanced by optimizing MT decoding weights on broadcast speech data • Optimization can compensate for differences in style between newswire text and STT transcript (esp. on broadcast conversations) • Optimization Issue: • MT optimization requires one-to-one mapping between translation hypotheses and references on the tuning set • Non-trivial to tune on translations of automatically segmented STT output • Solutions: • Re-segment STT output according to reference segmentation prior to translation, then use translation hypotheses for tuning • Tune based on translations of the STT reference transcriptions
MT Optimization Results • Updated development sets • Results • MT02: tuning on translations of the 2002 NIST MT evaluation set • BNC-STT: tuning on translations of manually segmented (according to reference) STT output • BNC-REF: tuning on translations of reference transcripts
Conclusions and Future Research • Results on 1-best STT/MT integration show that sentence boundary detection has a large impact on MT performance • Segmentation should be based on both audio and STT transcript • Better performance is expected by coupling STT and MT more tightly • Have begun running MT on consensus networks from STT output • Will explore joint optimization of STT and MT system parameters • At current operating point, improvements in MT will have the largest effect