1 / 23

Towards Superhuman Speech Recognition

Towards Superhuman Speech Recognition. Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group. Common UI Folklore.

sheryl
Download Presentation

Towards Superhuman Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group

  2. Common UI Folklore “Except when interacting with video games, a user does not take very well to surprises” Human-Computer Interaction Dix, Finley, Aboud and Beale “Golden Rule #3: Make the interface consistent” Elements of user interface design Mandel “Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently” Designing the User Interface Shneiderman

  3. Speech Recognition Progress

  4. Human Performance(Lippmann, 1997)

  5. Problem Categorization

  6. Domain Dependence

  7. - 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech) - 2. multi-environment speech sources (16K, 8K, far-field microphone, noisy ...) - 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news) - 4. domain-dependence of performance Observations Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech) Focus areas Improve spontaneous speech models 1.Articulatory modeling 2. Prosodic features 3. Segmental graphical models 4. Joint parameter estimation 5. Speaker separation for multi-speaker speech 6. Data collection for "meeting speech" Multi-environment 1. non-linear feature space transformation 2. Hidden observations Multi-domain 1. Multistyle training 2. Domain independent LM

  8. 30% Improvement • No initial decoding

  9. ASR Workshop

  10. A Language Model that Works Well on Many Domains • Different (static) language models work best on different domains • Use dynamic adaptation to make a generic LM act like a domain-specific LM • Generic LM – linear interpolation of collection of domain-specific LMs (SWB, BN, digit/date grammar, etc.) • Adapt by dynamically adjusting interpolation weights • Want to be able to adapt quickly • At the word/sentence level, not at the document level Um, yeah. Well, anyway, I’ll be arriving atfour twenty two p.m. on flight fifty six.Say hi to mom. Oh, and don’t forget tobuy IBM at one forty-four.

  11. Adapting Language Model Interpolation Weights • Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation) • Can be quite slow because have to accumulate a lot of evidence • Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation) • Rate of adaptation can be fast, depend on context, and can be trained on domain labelled data.

  12. Other Factors Driving Progress

  13. What Types of Data Do We Need?

  14. 2000 Hours/year 50000 hours/year (25) 5000 hours of speech Cost ~ $1M Some Concrete Suggestions Target: 5000 Hours of transcribed spontaneous speech Sources of new data: Supergirl By David Odell Script - Revised Screenplay Word Document Superman: The Motion Picture By Mario Puzo Early Draft Script Superman: The Motion Picture By Mario Puzo Shooting Script Superman II Directed By Richard Donner Script - Early Version Superman II Directed By Richard Lester Script Later Version Superman II Shooting Script Superman IV: The Quest for Peace By Christopher Reeve, Script - Superman: The Man Of Steel By Alex Ford & J Ellison Script - Unproduced Superman Lives By Kevin Smith Script - Unproduced Superman Lives By Dan Gilroy Script synopsis Unproduced • Test data: Mixture of current and new sources • Switchboard, Voicemail, BN, DC, OGI • SPEECON, Meetings

  15. Conclusions • Speech recognition performance not adequate • Human performance figures suggests that we still have enormous room for improvement • Presented several new algorithms to attack problem aggressively • Suggested training and test methodology to drive research • Communal participation critical to push ahead

More Related