Conditional Random Fields for ASR

Conditional Random Fieldsfor ASR Jeremy Morris 11/23/2009

Outline • Background • Maximum Entropy models and CRFs • CRF Example • ASR experiments with CRFs

Background • Conditional Random Fields (CRFs) • Discriminative probabilistic sequence model • Used successfully in various domains such as part of speech tagging and named entity recognition • Directly defines a posterior probability of a label sequence Y given an input observation sequence X - P(Y|X)

Background – Discriminative Models • Directly model the association between the observed features and labels for those features • e.g. neural networks, maximum entropy models • Attempt to model boundaries between competing classes • Probabilistic discriminative models • Give conditional probabilities instead of hard class decisions • Find the class y that maximizes P(y|x) for observed features x

Background – Discriminative Models • Contrast with generative models • e.g. GMMs, HMMs • Find the best model of the distribution to generate the observed features • Find the label y that maximizes the joint probability P(y,x) for observed features x • More parameters to model than discriminative models • More assumptions about feature independence required

Background – Sequential Models • Used to classify sequences of data • HMMs the most common example • Find the most probable sequence of class labels • Class labels depend not only on observed features, but on surrounding labels as well • Must determine transitions as well as state labels

Background – Sequential Models • Sample Sequence Model - HMM

Conditional Random Fields • A probabilistic, discriminative classification model for sequences • Based on the idea of Maximum Entropy Models (Logistic Regression models) expanded to sequences

Maximum Entropy Models • Probabilistic, discriminative classifiers • Compute the conditional probability of a class y given an observation x – P(y|x) • Build up this conditional probability using the principle of maximum entropy • In the absence of evidence, assume a uniform probability for any given class • As we gain evidence (e.g. through training data), modify the model such that it supports the evidence we have seen but keeps a uniform probability for unseen hypotheses

Maximum Entropy Example • Suppose we have a bin of candies, each with an associated label (A,B,C, or D) • Each candy has multiple colors in its wrapper • Each candy is assigned a label randomly based on some distribution over wrapper colors A B A * Example inspired by Adam Berger’s Tutorial on Maximum Entropy

Maximum Entropy Example • Now suppose we add some evidence to our model • We note that 80% of all candies with red labels are either labeled A or B • P(A|red) + P(B|red) = 0.8 • The updated model that reflects this would be: • P(A|red) = 0.4 • P(B|red) = 0.4 • P(C|red) = 0.1 • P(D|red) = 0.1 • As we make more observations and find more constraints, the model gets more complex

Maximum Entropy Models • “Evidence” is given to the MaxEnt model through the use of feature functions • Feature functions provide a numerical value given an observation • Weights on these feature functions determine how much a particular feature contributes to a choice of label • In the candy example, feature functions might be built around the existence or non-existence of a particular color in the wrapper • In NLP applications, feature functions are often built around words or spelling features in the text

Maximum Entropy Models • The maxent model for k competing classes • Each feature functions(x,y) is defined in terms of the input observation (x) and the associated label (y) • Each feature function has an associated weight (λ)

Maximum Entropy – Feature Funcs. • Feature functions for a maxent model associate a label and an observation • For the candy example, feature functions might be based on labels and wrapper colors • In an NLP application, feature functions might be based on labels (e.g. POS tags) and words in the text

Maximum Entropy – Feature Funcs. • Example: MaxEnt POS tagging • Associates a tag (NOUN) with a word in the text (“dog”) • This function evaluates to 1 only when both occur in combination • At training time, both tag and word are known • At evaluation time, we evaluate for all possible classes and find the class with highest probability

Maximum Entropy – Feature Funcs. • These two feature functions would never fire simultaneously • Each would have its own lambda-weight for evaluation

Maximum Entropy – Feature Funcs. • MaxEnt models do not make assumptions about the independence of features • Depending on the application, feature functions can benefit from context

Maximum Entropy – Feature Funcs. • Other feature functions possible beyond simple word/tag association • Does the word have a particular prefix? • Does the word have a particular suffix? • Is the word capitalized? • Does the word contain punctuation? • Ability to integrate many complex but sparse observations is a strength of maxent models.

Conditional Random Fields Y Y Y Y Y • Extends the idea of maxent models to sequences

Conditional Random Fields Y Y Y Y Y • Extends the idea of maxent models to sequences • Label sequence Y has a Markov structure • Observed sequence X may have any structure X X X X X

Conditional Random Fields Y Y Y Y Y • Extends the idea of maxent models to sequences • Label sequence Y has a Markov structure • Observed sequence X may have any structure X X X X X State functions help determine the identity of the state

Transition functions add associations between transitions from one label to another Conditional Random Fields Y Y Y Y Y • Extends the idea of maxent models to sequences • Label sequence Y has a Markov structure • Observed sequence X may have any structure X X X X X State functions help determine the identity of the state

Conditional Random Fields • CRF extends the maxent model by adding weighted transition functions • Both types of functions can be defined to incorporate observed inputs

Conditional Random Fields • Feature functions defined as for maxent models • Label/observation pairs for state feature functions • Label/label/observation triples for transition feature functions • Often transition feature functions are left as “bias features” – label/label pairs that ignore the attributes of the observation

Condtional Random Fields • Example: CRF POS tagging • Associates a tag (NOUN) with a word in the text (“dog”) AND with a tag for the prior word (DET) • This function evaluates to 1 only when all three occur in combination • At training time, both tag and word are known • At evaluation time, we evaluate for all possible tag sequences and find the sequence with highest probability (Viterbi decoding)

SLaTe Experiments - Background • Goal: Integrate outputs of speech attribute detectors together for recognition • e.g. Phone classifiers, phonological feature classifiers • Attribute detector outputs highly correlated • Stop detector vs. phone classifier for /t/ or /d/ • Accounting for correlations in HMM • Ignore them (decreased performance) • Full covariance matrices (increased parameters) • Explicit decorrelation(e.g. PCA)

SLaTe Experiments - Background • Speech Attributes • Phonological feature attributes • Detector outputs describe phonetic features of a speech signal • Place, Manner, Voicing, Vowel Height, Backness, etc. • A phone is described with a vector of feature values • Phone class attributes • Detector outputs describe the phone label associated with a portion of the speech signal • /t/, /d/, /aa/, etc.

SLaTe Experiments - Background • CRFs for ASR • Phone Classification (Gunawardana et al., 2005) • Uses sufficient statistics to define feature functions • Different approach than NLP tasks using CRFs • Define binary feature functions to characterize observations • Our approach follows the latter method • Use neural networks to provide “soft binary” feature functions (e.g. posterior phone outputs)

SLaTe Experiments • Implemented CRF models on data from phonetic attribute detectors • Performed phone recognition • Compared results to Tandem/HMM system on same data • Experimental Data • TIMIT corpus of read speech

SLaTe Experiments - Attributes • Attribute Detectors • ICSI QuickNet Neural Networks • Two different types of attributes • Phonological feature detectors • Place, Manner, Voicing, Vowel Height, Backness, etc. • N-ary features in eight different classes • Posterior outputs -- P(Place=dental | X) • Phone detectors • Neural networks output based on the phone labels • Trained using PLP 12+deltas

SLaTe Experiments - Setup • CRF code • Built on the Java CRF toolkit from Sourceforge • http://crf.sourceforge.net • Performs maximum log-likelihood training • Uses Limited Memory BGFS algorithm to perform minimization of the log-likelihood gradient

Experimental Setup • Feature functions built using the neural net output • Each attribute/label combination gives one feature function • Phone class: s/t/,/t/ or s/t/,/s/ • Feature class: s/t/,stop or s/t/,dental

Experimental Setup • Baseline system for comparison • Tandem/HMM baseline (Hermansky et al., 2000) • Use outputs from neural networks as inputs to gaussian-based HMM system • Built using HTK HMM toolkit • Linear inputs • Better performance for Tandem with linear outputs from neural network • Decorrelated using a Karhunen-Loeve (KL) transform (PCA)

Feature Combinations • CRF model supposedly robust to highly correlated features • Makes no assumptions about feature independence • Tested this claim with combinations of correlated features • Phone class outputs + Phono. Feature outputs • Posterior outputs + transformed linear outputs • Also tested whether linear, decorrelated outputs improve CRF performance

Results (Morris & Fosler-Lussier ‘08) * Significantly (p<0.05) better than comparable CRF monophone system * Significantly (p<0.05) better than comparable Tandem 4mix triphone system * Signficantly (p<0.05) better than comparable Tandem 16mix triphone system

Conclusions • Using correlated features in the CRF model did not degrade performance • Extra features improved performance for the CRF model across the board • Viterbi realignment training significantly improved CRF results • Improvement did not occur when best HMM-aligned transcript was used for training

Extension – Word Decoding • Use the CRF model to generate features for an HMM • “Crandem” system (Morris & Fosler-Lussier,09) • Performance similar to a similarly trained Tandem HMM system • Direct word word decoding over CRF lattice • In progress – preliminary experiments over restricted vocabulary (digits) match state of the art performance • Currently working on extending to larger vocabulary

References • J. Lafferty et al, “Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data”, Proc. ICML, 2001 • A. Berger, “A Brief MaxEnt Tutorial”, http://www.cs.cmu.eu/afs/cs/user/aberger/www/html/tutorial/tutorial.html • R. Rosenfeld, “Adaptive statistical language modeling: a maximum entropy approach”, PhD thesis, CMU, 1994 • A. Gunawardana et al, “Hidden Conditional Random Fields for phone classification”, Proc. Interspeech, 2005 • J. Morris and E. Fosler-Lussier. “Conditional Random Fields for Integrating Local Discriminative Classifiers”, IEEE Transactions on Audio, Speech and Language Processing, 2008 • J. Morris and E. Fosler-Lussier. “Crandem: Conditional Random Fields for Word Recognition”, Proc. Of Interspeech 2009.

Initial Results (Morris & Fosler-Lussier, 06) * Significantly (p<0.05) better than comparable Tandem monophone system * Significantly (p<0.05) better than comparable CRF monophone system

Feature Combinations - Results * Significantly (p<0.05) better than comparable posterior or linear KL systems

Viterbi Realignment • Hypothesis: CRF results obtained by using only pre-defined boundaries • HMM allows “boundaries” to shift during training • Basic CRF training process does not • Modify training to allow for better boundaries • Train CRF with fixed boundaries • Force align training labels using CRF • Adapt CRF weights using new boundaries

Future Work • Recently implemented stochastic gradient training for CRFs • Faster training, improved results • Work currently being done to extend the model to word recognition • Also examining the use of transition functions that use the observation data • Crandem system does this with improved results for phone recogniton

Conditional Random Fields • Example – POS tagging (Lafferty, 2001) • State feature functions defined as word/label pairs • Transition feature functions defined as label/label pairs • Achieved results comparable to an HMM with the same features

Conditional Random Fields • Example – POS tagging (Lafferty, 2001) • Adding more complex and sparse features improved the CRF performance • Capitalization? • Suffixes? (-iy, -ing, -ogy, -ed, etc.) • Contains a hyphen?

Conditional Random Fields /k/ /k/ /iy/ /iy/ /iy/ • Based on the framework of Markov Random Fields

/k/ /k/ /iy/ /iy/ /iy/ X X X X X Conditional Random Fields • Based on the framework of Markov Random Fields • A CRF iff the graph of the label sequence is an MRF when conditioned on a set of input observations (Lafferty et al., 2001)

/k/ /k/ /iy/ /iy/ /iy/ X X X X X Conditional Random Fields • Based on the framework of Markov Random Fields • A CRF iff the graph of the label sequence is an MRF when conditioned on the input observations State functions help determine the identity of the state

/k/ /k/ /iy/ /iy/ /iy/ X X X X X Transition functions add associations between transitions from one label to another Conditional Random Fields • Based on the framework of Markov Random Fields • A CRF iff the graph of the label sequence is an MRF when conditioned on the input observations State functions help determine the identity of the state

Conditional Random Fields • CRF defined by a weighted sum of state and transition functions • Both types of functions can be defined to incorporate observed inputs • Weights are trained by maximizing the likelihood function via gradient descent methods

Conditional Random Fields for ASR