HMM vs. Maximum Entropy for SU Detection

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004

Outline • SU Detection Problem • Two Modeling Approaches • Experimental Results • Conclusions & Future Work

SU Detection Problem • Find the sentence-like boundaries given the word sequences (human transcripts or speech recognition output) and speech signal • Why? • Easier for human comprehension • Needed by NLP modules • May help speech recognition accuracy

SU Detection Using HMM • Sequence decoding, viterbi algorithm • For a classification task, it is better to find the most likely event at each interword boundary then determine the most likely sequence. Ei=argmax(Eij|W,F), forward-backward algorithm

Terms in the HMM Approach • Transition probability • Hidden event LM • P(W,E)=P(W1 E1 W2 ..) • Maximum likelihood parameter estimation • Emission probability • P(Fi|Ei) ~ P(Ei|Fi) • Decision trees estimate posterior probs given prosodic features

SU Detection Using Maxent • Represent as a classification task; each sample has an associated feature set • P(Ei|Oi) Oi=(c(wi), Fi) • Naïve-Bayes method • Parameter estimation • Maximize joint likelihood P(E,O), ML parameter estimation • Maximize conditional likelihood P(E|O), i.e. maximum entropy

Maxent Introduction • Models things known, assumes nothing unknown. Estimate of p(y|x): • Satisfies constraints • Empirical distribution is equal to the expected value of features with respect to the model p(y|x) • Has maximum entropy • Exponential format

Features in Maxent • Features are indicator functions • f(x,y) =1 (if x=‘uhhuh’, y=SU) =0 otherwise • Lambdas are estimated iteratively • We use a variety of features: • Word (different ngrams, different positional info) • POS • Chunk • Automatically-induced classes

Features in Maxent (cont) • P(Ei|Fi) • Convenient to use binary features in the maxent approach • Encode posterior probabilities from decision trees into binary features, using accumulative binning • Decisions from other classifiers (LMs)

Differences between HMM and Maxent • Both use word context; maxent uses only Fi, while HMM uses all F via the forward-backward algorithm • Maxent bins the posterior probabilities, thus losing some information • Maxent maximizes the conditional likelihood P(E|O), HMM maximizes the joint P(W,E) • Combining LMs, HMM linearly interpolates posterior probabilities, using independence assumption; maxent more tightly integrates overlapping features

BN & CTS SU Detection • RT03 dev and eval set • Evaluation: • Error = (#missing + # false alarms)/# ref SUs

Some Findings • Errors increase in face of recognition errors • Maxent degrades more, possibly due to the higher reliance on textual info and reduced dependence on prosodic information • Maxent yields more gain on CTS than BN • More training data for CTS task? • Prosody is more important for BN? • Different genre • It is easy to combine highly related knowledge sources in maxent • HMM: interpolation makes independence assumption

Error Type • Different error patterns --- HMM has more insertion errors (prosody model tends to add false alarms), therefore, they can be effectively combined

Effect on LM & Prosody

Findings • By using textual info only, Maxent performs much better than the HMM • HMM uses the independence assumption when combining multiple LMs • Maxent better integrates different textual knowledge sources • When prosody is included • Gain from maxent is lost • Encode posterior probabilities in a lossy way • Only Fi is used in maxent, all F is used in HMM

Conclusions • Combination of HMM and maxent achieves the best performance • Both approaches make inaccurate assumptions and have some advantages • Optimization metric matches performance measure (Conditional likelihood better than joint likelihood, still not the best) • Independence assumption (loose interpolation in HMM, Maxent uses Fi) • Maxent is more computationally demanding than HMM

Future Work • HMM • Maximize conditional likelihood • Joint word and class LMs • Maxent • Use numerical features, not just binary • Preserve prosodic probabilities • May be able to use confidence measure in STT output • MCE discriminative training

HMM vs. Maximum Entropy for SU Detection