Automatic Detection-based Phone Recognition on TIMIT

Automatic Detection-based Phone Recognition on TIMIT Based on Chen and Wang in ISCSLP’08 and Interspeech’09 Hung-Shin Lee (李鴻欣) 12 July, 2011 @ IIS, Academia Sinica

Detection-Based ASR Human SR Knowledge Detection Integration Knowledge (Higher Level) DB ASR Detectors Integrator Results • Phone • Syllable • Word • Sentence • Semantic info • … • HMM • CRF • … • Phonological attr. • Prosodic attr. • Acoustic attr. • …

Phonological Systems

Phonological Feature Detection (1) 9 frames 0 1 0 1 . . . 0 1 MLP (Detectors) 13 MFCCs SPE_14 posterior probability hiddenlayer input layer i-4 i i+4 quantization 0 1 1 . . 0 1 GP_11 time-delay recurrent

Phonological Feature Detection (2) 9 frames 6 MV Features 13 MFCCs 0 1 0 0 MLP (Centrality) 0 1 0 0 1 0 0 . . . . . . . . . 0 1 0 i-4 i i+4 MLP (Front-Back) 1 0 0 MV_29 time-delay 0 1 0 MLP (Roundness)

Conditional Random Field (CRF) Integrator • General Chain CRF λj, μk : feature function weight parameters state feature function transition feature function yi-1 yi Output (phone) Y . . . . . . . . . Input (phonological features) X xi-1 xi xi+1

CRF Integrator – Training Issues • Required Label for CRF Training • Phone: y • Phonological features: x Oracle-data trained CRF Phonological features OT CRF Mapping phones → phonological features Phone labels Training Data Phone labels Phonological features (with errors) Speech DT CRF Detectors MLP Detected-data trained CRF

Experiments • Corpus: TIMIT • No SA1, SA2 • Training set (3296 utts), Dev set (400 utts) • Test set (1344 utts) • Phone set: TIMIT61 • Evaluation: CMU/MIT 39 • Baseline • CI-HMM • Toolkits • Nico Toolkit (for MLP), CRF++ (for CRF)

Results (1) Model: OT CRF Test: OD Features Model: OT/DT CRF Test: DD Features

Results (2) System Fusion

System Fusion with CRF yi-1 yi Combined Results (Phone) Y . . . . . . . . . SPE Sys. MV Sys. Phone Sequence X GP Sys. HMM Sys. xi-1 xi xi+1

Two Types of AFDTImperfection Phone h# n eh ow kcl k w eh ae eh s tcl t ix n AF(A) AF(A’) AF asynchrony AFDT errors

Phone AFs CRF Training (1) Phone y Detected Errors t Phone y t AFDT Mapping Table AFs x Oracle Data Training AFs x Detected Data Training

CRF Training (2) AF Sequence Phone y t AFDT AFs x Aligned Data Training

Results (3) 27.97 % acc. drops on the introduction of AF asynchrony Detection Error causes further 7.99 % acc. drop

72Dim Windows + DCTs MLP Right Context 72Dim 144Dim MLP Left Context Windows + DCTs MLP 23 dim Mel 72Dim 310ms AF Asynchrony Compensation • AF asynchrony is caused by context variation • We can reduce AF asynchrony by letting our systems learn context variation directly – Long-Term information

Results (4)

Conclusions • A well-designed phonological feature system is important • AF asynchrony minimization training and AF-phone synchronization could also be investigated • Oracle Trained CRF is able to retrieve more phonological information from speech • High phone correction rate (but sensitive to detection error) • Helpful for combination • Detection-Based ASR is promising • A front-end detector is a major issue

t t t t t AF and Phone Alignment Using AFDT phone sequence AF sequence

Automatic Detection-based Phone Recognition on TIMIT