Voicing Features

Voicing Features • Horacio Franco, Martin Graciarena • Andreas Stolcke, Dimitra Vergyri, Jing Zheng • STAR Lab. SRI International

Phonetically Motivated Features • Problem: • Cepstral coefficients fail to capture many discriminative cues. • Front-end optimized for traditional Mel cepstral features. • Front-end parameters are a compromise solution for all phones.

Phonetically Motivated Features • Proposal: • Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends. • Optimize each specific front-end to improve discrimination. • Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding. • General framework for multiple phonetic features. First approach: voicing features.

Voicing Features • Voicing features algorithms: • Normalized peak autocorrelation(PA). For time frame X • max computed in pitch region 80Hz to 450Hz • Entropy of high order cepstrum (EC) and linear spectra (ES).If • And H is the entropy of Y, • then • Entropy computed in pitch region 80Hz to 450Hz

Voicing Features • Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform • for the frequency band for speech signal • If IT is an impulse train, the template is • and the signal DLFT • the correlation for frame j with the template is • the DP optimal correlation is • max computed in pitch region 80Hz to 450Hz

Voicing Features • Preliminary exploration of voicing features: • - Best feature combination: Peak Autocorrelation + Entropy Cepstrum • - Complementary behavior of autocorrelation and entropy features for high and low pitch. • Low pitch: time periods are well separated therefore correlation is well defined. • High pitch: harmonics are well separated and cepstrum is well defined.

Voicing Features • Graph of voicing features: w er k ay n d ax f s: aw th ax v dh ey ax r

Voicing Features • Integration of Voicing Features: • 1 - Juxtaposing Voicing Features: • Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD) • Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.

Voicing Features • Train small switchboard database (64 hours). Test on dev 2001. WER for both sexes. • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. • VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM.

Voicing Features • 2 – Voiced/Unvoiced Posterior Features: • Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40. • Similar setup as before. Males only results. • Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.

Voicing Features • 3 –Window of Voicing Features + HLDA: • Juxtapose MFCC features and window of voicing features around current frame. • Apply dimensionality reduction with HLDA. Final feature had 39 dimensions. • Same setup as before, MFCC+D+DD+3rd diffs. Both sexes. • Baseline 1.5% abs. better, Voicing improves 1% more. 39.5 39.5

Voicing Features • 4 – Delta of Voicing Features + HLDA: • Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature. • Same setup as before, MFCC+D+DD+3rd diffs. Males only. • Reason may be variability in voicing features produce noisy deltas. • HLDA weighting of “window of voicing features” is similar to average. • ---------------------------------------------------------------------------------- •  The best overall configuration was MFCC+D+DD+3rd diffs. and 10 voicing features + HLDA.

Voicing Features • Voicing Features in SRI CTS Eval. Sept 03 System: • Adaptation of MMIE cross-word models w/wo voicing features. • Used best configuration of voicing features. • Train on Full SWBD+CTRANS data. Test on EVAL’02. • Feature: MFCC+D+DD+3rd diffs.+HLDA • Adaptation: 9 transforms full matrix MLLR. • Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features.

Voicing Features • Hypothesis Examples: • REF: OH REALLY WHAT WHAT KIND OF PAPER • HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER • HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER • REF: YOU KNOW HE S JUST SO UNHAPPY • HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY • HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY

Voicing Features • Error analysis: • In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase. • Still need a more detailed study of speaker dependent performance. • Implementation: • Implemented a voicing feature engine in DECIPHER system. • Fast computation, using one FFT and two IFFTs per frame for both voicing features.

Voicing Features • Conclusions: • Explored how to represent/integrate the voicing features for best performance. • Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system. • Future work: • Still need to further explore feature combination/selection • Develop more reliable voicing features, features not always reflect actual voicing activity • Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).

Voicing Features