Conditional Random Fields for Automatic Speech Recognition

Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 05/12/2010

Motivation • What is the purpose of Automatic Speech Recognition? • Take an acoustic speech signal … • … and extract higher level information (e.g. words) from it “speech”

Motivation • How do we extract this higher level information from the speech signal? • First extract lower level information • Use it to build models of phones, words “speech” / s p iych/

Motivation • State-of-the-art ASR takes a top-down approach to this problem • Extract acoustic features from the signal • Model a process that generates these features • Use these models to find the word sequence that best fits the features “speech” / s p iych/

Motivation • A bottom-up approach • Look for evidence of speech in the signal • Phones, phonological features • Combine this evidence together to find the most probable sequence of words in the signal voicing? burst? frication? “speech” / s p iych/

Motivation • How can we combine this evidence? • Conditional Random Fields (CRFs) • Discriminative, probabilistic sequence model • Models the conditional probability of a sequence given evidence voicing? burst? frication? “speech” / s p iych/

Outline • Motivation • CRF Models • Phone Recognition • HMM-CRF Word Recognition • CRF Word Recognition • Conclusions

CRF Models • Conditional Random Fields (CRFs) • Discriminative probabilistic sequence model • Directly defines a posterior probability P(Y|X) of a label sequence Y given evidence X

CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence

CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence

CRF Models • The structure of the evidence can be arbitrary • No assumptions of independence • States can be influenced by any evidence • Evidence can influence transitions between states

CRF Models • Evidence is incorporated via feature functions state feature functions

CRF Models • Evidence is incorporated via feature functions transition feature functions state feature functions

CRF Models state feature functions transition feature functions • The form of the CRF is an exponential model of weighted feature functions • Weights trained via gradient descent to maximize the conditional likelihood

Phone Recognition • What evidence do we have to combine? • MLP ANN trained to estimate frame-level posteriors for phonological features • MLP ANN trained to estimate frame-level posteriors for phone classes P(voicing|X) P(burst|X) P(frication|X) … P( /ah/ | X) P( /t/ | X) P( /n/ | X) …

Phone Recognition • Use these MLP outputs to build state feature functions

Phone Recognition • Pilot task – phone recognition on TIMIT • ICSI Quicknet MLPs trained on TIMIT, used as inputs to the CRF models • Compared to Tandem and a standard PLP HMM baseline model • Output of ICSI Quicknet MLPs as inputs • Phone class attributes (61 outputs) • Phonological features attributes (44 outputs)

Phone Recognition *Signficantly(p<0.05) better than comparable Tandem system (Morris & Fosler-Lussier 08)

Phone Recognition • Moving forward: How do we make use of CRF classification for word recognition? • Attempt to fit CRFs into current state-of-the-art models for speech recognition? • Attempt to use CRFs directly? • Each approach has its benefits • Fitting CRFs into a standard framework lets us reuse existing code and ideas • A model that uses CRFs directly opens up new directions for investigation • Requires some rethinking of the standard model for ASR

HMM-CRF Word Recognition • Inspired by Tandem HMM systems • Uses ANN outputs as input features to an HMM “speech” / s p iych/ PCA

HMM-CRF Word Recognition • Inspired by Tandem HMM systems • Uses ANN outputs as input features to an HMM • HMM-CRF system (Crandem) • Use a CRF to generate input features for HMM • See if improved phone accuracy helps the system • Problem: CRFs estimate probability of the entire sequence, not individual frames “speech” / s p iych/ PCA

HMM-CRF Word Recognition • One solution: Forward-Backward Algorithm • Used during CRF training to maximized conditional likelihood • Provides an estimate of the posterior probability of a phone label given the input

HMM-CRF Word Recognition • Original Tandem system “speech” / s p iych/ PCA

HMM-CRF Word Recognition • Modified Tandem system (Crandem) Local Feature Calc. PCA “speech” / s p iych/

HMM-CRF Word Recognition • Pilot task – phone recognition on TIMIT • Same ICSI Quicknet MLP outputs used as inputs • Crandem compared to Tandem, a standard PLP HMM baseline model, and to the original CRF • Evidence on transitions • This work also examines the effect of using the same MLP outputs as transition features for the CRF

HMM-CRF Word Recognition • Pilot Results 1 (Fosler-Lussier & Morris 08) *Significant (p<0.05) improvement at 0.6% difference between models

HMM-CRF Word Recognition • Pilot Results 2 (Fosler-Lussier & Morris 08) *Significant (p<0.05) improvement at 0.6% difference between models

HMM-CRF Word Recognition • Extension – Word recognition on WSJ0 • New MLPs and CRFs trained on WSJ0 corpus of read speech • No phone level assignments, only word transcripts • Initial alignments from HMM forced alignment of MFCC features • Compare Crandem baseline to Tandem and original MFCC baselines • WJ0 5K Word Recognition task • Same bigram language model used for all systems

HMM-CRF Word Recognition • Results (Morris & Fosler-Lussier 09) *Significant (p≤0.05) improvement at roughly 0.9% difference between models

HMM-CRF Word Recognition *Significant (p≤0.05) improvement at roughly 0.06% difference between models

HMM-CRF Word Recognition Comparison of MLP activation vs. CRF activation

HMM-CRF Word Recognition Ranked average per-frame activation MLP vs. CRF

HMM-CRF Word Recognition • Insights from these experiments • CRF posteriors very different in flavor from MLP posteriors • Overconfident in local decision being made • Higher phone accuracy did not translate to lower WER • Further experiment to test this idea • Transform posteriors via taking a root and renormalizing • Bring classes closer together • Achieved results insignificantly different from baseline, no longer degraded with further epochs of training (though no improvement either)

CRF Word Recognition • Instead of feeding CRF outputs into an HMM “speech” / s p iych/ 42

CRF Word Recognition • Instead of feeding CRF outputs into an HMM • Why not decode words directly off the CRF? “speech” / s p iych/ “speech” / s p iych/ 43

CRF Word Recognition • The standard model of ASR uses likelihood based acoustic models • CRFs provide a conditional acoustic model P(Φ|X) Acoustic Model Lexicon Model Language Model

CRF Word Recognition Lexicon Model Language Model CRF Acoustic Model Phone Penalty Model

CRF Word Recognition • Models implemented using OpenFST • Viterbi beam search to find best word sequence • Word recognition on WSJ0 • WJ0 5K Word Recognition task • Same bigram language model used for all systems • Same MLPs used for CRF-HMM (Crandem) experiments • CRFs trained using 3-state phone model instead of 1-state model • Compare to Tandem and original MFCC baselines

CRF Word Recognition • Results – Phone Classes only *Significant (p≤0.05) improvement at roughly 0.9% difference between models

CRF Word Recognition • Results – Phone & Phonological features *Significant (p≤0.05) improvement at roughly 0.9% difference between models

Conclusions & Future Work • Designed and developed software for CRF training for ASR • Developed a system for word-level ASR using CRFs • Meets baseline performance of an MLE trained HMM system • Platform for further exploration

Conditional Random Fields for Automatic Speech Recognition