200 likes | 423 Views
Automatic Detection-based Phone Recognition on TIMIT. Based on Chen and Wang in ISCSLP’08 and Interspeech’09. Hung-Shin Lee ( 李鴻欣 ). 12 July, 2011 @ IIS, Academia Sinica. Detection-Based ASR. Human SR. Knowledge Detection. Integration. Knowledge (Higher Level). DB ASR.
E N D
Automatic Detection-based Phone Recognition on TIMIT Based on Chen and Wang in ISCSLP’08 and Interspeech’09 Hung-Shin Lee (李鴻欣) 12 July, 2011 @ IIS, Academia Sinica
Detection-Based ASR Human SR Knowledge Detection Integration Knowledge (Higher Level) DB ASR Detectors Integrator Results • Phone • Syllable • Word • Sentence • Semantic info • … • HMM • CRF • … • Phonological attr. • Prosodic attr. • Acoustic attr. • …
Phonological Feature Detection (1) 9 frames 0 1 0 1 . . . 0 1 MLP (Detectors) 13 MFCCs SPE_14 posterior probability hiddenlayer input layer i-4 i i+4 quantization 0 1 1 . . 0 1 GP_11 time-delay recurrent
Phonological Feature Detection (2) 9 frames 6 MV Features 13 MFCCs 0 1 0 0 MLP (Centrality) 0 1 0 0 1 0 0 . . . . . . . . . 0 1 0 i-4 i i+4 MLP (Front-Back) 1 0 0 MV_29 time-delay 0 1 0 MLP (Roundness)
Conditional Random Field (CRF) Integrator • General Chain CRF λj, μk : feature function weight parameters state feature function transition feature function yi-1 yi Output (phone) Y . . . . . . . . . Input (phonological features) X xi-1 xi xi+1
CRF Integrator – Training Issues • Required Label for CRF Training • Phone: y • Phonological features: x Oracle-data trained CRF Phonological features OT CRF Mapping phones → phonological features Phone labels Training Data Phone labels Phonological features (with errors) Speech DT CRF Detectors MLP Detected-data trained CRF
Experiments • Corpus: TIMIT • No SA1, SA2 • Training set (3296 utts), Dev set (400 utts) • Test set (1344 utts) • Phone set: TIMIT61 • Evaluation: CMU/MIT 39 • Baseline • CI-HMM • Toolkits • Nico Toolkit (for MLP), CRF++ (for CRF)
Results (1) Model: OT CRF Test: OD Features Model: OT/DT CRF Test: DD Features
Results (2) System Fusion
System Fusion with CRF yi-1 yi Combined Results (Phone) Y . . . . . . . . . SPE Sys. MV Sys. Phone Sequence X GP Sys. HMM Sys. xi-1 xi xi+1
Two Types of AFDTImperfection Phone h# n eh ow kcl k w eh ae eh s tcl t ix n AF(A) AF(A’) AF asynchrony AFDT errors
Phone AFs CRF Training (1) Phone y Detected Errors t Phone y t AFDT Mapping Table AFs x Oracle Data Training AFs x Detected Data Training
CRF Training (2) AF Sequence Phone y t AFDT AFs x Aligned Data Training
Results (3) 27.97 % acc. drops on the introduction of AF asynchrony Detection Error causes further 7.99 % acc. drop
72Dim Windows + DCTs MLP Right Context 72Dim 144Dim MLP Left Context Windows + DCTs MLP 23 dim Mel 72Dim 310ms AF Asynchrony Compensation • AF asynchrony is caused by context variation • We can reduce AF asynchrony by letting our systems learn context variation directly – Long-Term information
Conclusions • A well-designed phonological feature system is important • AF asynchrony minimization training and AF-phone synchronization could also be investigated • Oracle Trained CRF is able to retrieve more phonological information from speech • High phone correction rate (but sensitive to detection error) • Helpful for combination • Detection-Based ASR is promising • A front-end detector is a major issue
t t t t t AF and Phone Alignment Using AFDT phone sequence AF sequence