1 / 17

HMM vs. Maximum Entropy for SU Detection

HMM vs. Maximum Entropy for SU Detection. Yang Liu 04/27/2004. Outline. SU Detection Problem Two Modeling Approaches Experimental Results Conclusions & Future Work. SU Detection Problem.

Download Presentation

HMM vs. Maximum Entropy for SU Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004

  2. Outline • SU Detection Problem • Two Modeling Approaches • Experimental Results • Conclusions & Future Work

  3. SU Detection Problem • Find the sentence-like boundaries given the word sequences (human transcripts or speech recognition output) and speech signal • Why? • Easier for human comprehension • Needed by NLP modules • May help speech recognition accuracy

  4. SU Detection Using HMM • Sequence decoding, viterbi algorithm • For a classification task, it is better to find the most likely event at each interword boundary then determine the most likely sequence. Ei=argmax(Eij|W,F), forward-backward algorithm

  5. Terms in the HMM Approach • Transition probability • Hidden event LM • P(W,E)=P(W1 E1 W2 ..) • Maximum likelihood parameter estimation • Emission probability • P(Fi|Ei) ~ P(Ei|Fi) • Decision trees estimate posterior probs given prosodic features

  6. SU Detection Using Maxent • Represent as a classification task; each sample has an associated feature set • P(Ei|Oi) Oi=(c(wi), Fi) • Naïve-Bayes method • Parameter estimation • Maximize joint likelihood P(E,O), ML parameter estimation • Maximize conditional likelihood P(E|O), i.e. maximum entropy

  7. Maxent Introduction • Models things known, assumes nothing unknown. Estimate of p(y|x): • Satisfies constraints • Empirical distribution is equal to the expected value of features with respect to the model p(y|x) • Has maximum entropy • Exponential format

  8. Features in Maxent • Features are indicator functions • f(x,y) =1 (if x=‘uhhuh’, y=SU) =0 otherwise • Lambdas are estimated iteratively • We use a variety of features: • Word (different ngrams, different positional info) • POS • Chunk • Automatically-induced classes

  9. Features in Maxent (cont) • P(Ei|Fi) • Convenient to use binary features in the maxent approach • Encode posterior probabilities from decision trees into binary features, using accumulative binning • Decisions from other classifiers (LMs)

  10. Differences between HMM and Maxent • Both use word context; maxent uses only Fi, while HMM uses all F via the forward-backward algorithm • Maxent bins the posterior probabilities, thus losing some information • Maxent maximizes the conditional likelihood P(E|O), HMM maximizes the joint P(W,E) • Combining LMs, HMM linearly interpolates posterior probabilities, using independence assumption; maxent more tightly integrates overlapping features

  11. BN & CTS SU Detection • RT03 dev and eval set • Evaluation: • Error = (#missing + # false alarms)/# ref SUs

  12. Some Findings • Errors increase in face of recognition errors • Maxent degrades more, possibly due to the higher reliance on textual info and reduced dependence on prosodic information • Maxent yields more gain on CTS than BN • More training data for CTS task? • Prosody is more important for BN? • Different genre • It is easy to combine highly related knowledge sources in maxent • HMM: interpolation makes independence assumption

  13. Error Type • Different error patterns --- HMM has more insertion errors (prosody model tends to add false alarms), therefore, they can be effectively combined

  14. Effect on LM & Prosody

  15. Findings • By using textual info only, Maxent performs much better than the HMM • HMM uses the independence assumption when combining multiple LMs • Maxent better integrates different textual knowledge sources • When prosody is included • Gain from maxent is lost • Encode posterior probabilities in a lossy way • Only Fi is used in maxent, all F is used in HMM

  16. Conclusions • Combination of HMM and maxent achieves the best performance • Both approaches make inaccurate assumptions and have some advantages • Optimization metric matches performance measure (Conditional likelihood better than joint likelihood, still not the best) • Independence assumption (loose interpolation in HMM, Maxent uses Fi) • Maxent is more computationally demanding than HMM

  17. Future Work • HMM • Maximize conditional likelihood • Joint word and class LMs • Maxent • Use numerical features, not just binary • Preserve prosodic probabilities • May be able to use confidence measure in STT output • MCE discriminative training

More Related