Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul

Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul

Outline • Schedule update • Investigating WordWave + auto segmentation quality • Updated evaluation method • Separating effect of transcripts and segmentation • Improved segmentation algorithm • Plans • Update on using Fisher data in Training

Data Schedule • BBN has received 925 hours from WordWave (WWave) • Processed and released 478 hours via LDC • 91 hrs on 8/1/03 • 300 on 9/24/03 • 87 on 10/21/03 • WWave is currently running more slowly than planned • Reason: CTS transcription is hard! • They will complete 1600 hrs by the end of Jan 04, with remaining 200 hrs to follow as quickly as possible.

Segmentation Quality as of Sept 03 • Auto segmentation goals: Given audio and transcript and no timing info, break into fairly short segments and align correct text to each segment • In September, we compared transcription and segmentation approaches on a 20 hour Swbd set: • LDC/MSU careful transcription and manual segmentation vs. • LDC fast transcription and manual segmentation vs. • WWave transcripts + BBN automatic segmentation. • Compared 2 different segmentation algorithms • Alg I: run recognizer and segment at “reliable” silences; decode using segmentation and reject based on sclite alignment errors • Alg II: use recognizer to get coarse initial segmentation; then forced alignment within coarse segs to find finer segs; final rejection pass as before.

Performance Comparison in Sept • Unadapted recognition; acoustic models trained with 20-hour Swbd1 set, LM trained on full Switchboard • ML, GI, VTL, HLDA-trained models

Improving the Evaluation Method • There were a number of issues and shortcuts in the training and test, that clouded comparisons. • We therefore • Adopted improved training sequence, including new binaries • Reduced pruning errors in decode • Converted from fast approximate VTL length estimation to more careful approach • Adopted more stable VTL models • VTL models trained on 20 hours differed dramatically for small changes in segmentation • This is a bug in our VTL model estimation that we need to fix • For following experiments used stable VTL models from RT03 eval • Switched from our historic LDC+MSU baseline to all MSU for simplicity.

Comparison with Better Train and Test

Separating Effect of Segmentation • Compare segmentations using identical (MSU) transcripts • Alg I WER same for WWave vs MSU transcripts • Segmentation may be biggest/only problem.

Segmentation Algorithm III • Algorithm II used forced alignment within coarse segments provided by initial pass of recognition, but examination revealed unrecoverable errors (words in wrong segment) from coarse initial seg. • Tried forced alignment of complete conversation sides • Overcame initial problems of failed alignments by • Pre-chopping out long silences, where our system tends to get confused • Used auto-segmenter developed for RT03 CTS eval for this • Changing forced alignment program to do much less pruning at begin and end of conversation • This accommodated things like beeps, line noise, and words cut off by recording start and stop • Forced alignment is followed by script that breaks segments at silences, then rejection pass

Algorithm III with MSU transcripts • Manually comparing MSU and Alg III showed that Alg III: • had more, shorter segments • had less silence padding around utterances • allowed utterances > 15 seconds when speaker did not pause • Modified Alg III to approximate MSU’s statistics

Improved Algorithm III • Matching MSU’s utterance lengths and silence improves WER slightly • Alg III seems good enough, at least for this task

Results with WordWave Transcripts • WWave transcripts seem fine given improved seg

Plans • Confirm quality of WWave with Alg III seg • On Swbd 20 hour set, train MMI models to compare all-MSU vs. WWave/Alg III • On Swbd + 150 hour Fisher experiment, where we got gains using Alg I segmented data. • Performance should not degrade • Improve speed of Alg III • Resegment and redistribute all data that has been released so far • Catch up with and continue segmenting latest WWave transcript deliveries.

Update on Adding Fisher Data • In Martigny, showed 1.4% gain for adding 150 hrs Fisher data (Alg I segmented) to RT03 training • Hoped to have results with 350 hours but we had bugs in our initial runs. • Did train MMI on RT03 (sw370) vs RT03+Fisher150 • Results on 2nd adaptation pass with POS LM rescoring • CAVEAT: non-rigorous comparison! Fisher150 system optimized (gains 0.1-0.2% gain); used diff phone set & faster training (degrades 0.2% in other comparisons).

Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul