1 / 67

Introduction to Computer Speech Processing

Introduction to Computer Speech Processing. Alex Acero Research Area Manager Microsoft Research. Outline. Grand challenges in Speech and Language Vision videos Products today Prototypes The role of speech Technology Introduction. Outline. Grand challenges in Speech and Language

Download Presentation

Introduction to Computer Speech Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Computer Speech Processing Alex Acero Research Area Manager Microsoft Research

  2. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  3. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  4. User Expectations for Speech

  5. The Turing Test • Imitation Game: • Judge, man, and a woman • All chat via Email. • Man pretends to be a woman. • Man lies, woman tries to help judge. • Judge must identify man after 5 minutes. • Turing Test • Replace man or woman with a computer. • Fool judge 30% of the time. Thanks to Jim Gray for material

  6. What Turing Said “I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.” Alan M.Turing, 1950 “Computing machinery and intelligence.” Mind, Vol. LIX. 433-460

  7. Prediction 59 Years Later • Turing’s technology forecast was great! • Gigabyte memory is common • Computer beat world chess champion • with some help from its programming staff! • Computers help design most things today

  8. Prediction 59 Years Later • Intelligence forecast was optimistic • Several internet sites offer Turning Test chatterbots. • None pass (yet) http://www.loebner.net/Prizef/loebner-prize.html • But I believe it will not be long: • less than 50 years, more than 10 years • Turing test still stands as a long-term challenge

  9. Challenges Implicit in the Turing Test • Read and understand as well as a human • Think and write as well as a human • Hear as well as a native speaker: • Speech Recognition (speech to text) • Speak as well as a native speaker: • Speech Synthesis (text to speech) • Remember what is heard and quickly return it on request.

  10. Moore’s law (1965) • Gordon Moore: “The number of transistors per chip will double every 18 months”: 100x per decade • Progress in next 18 months = ALL previous progress • New storage = sum of all old storage (ever) • New processing = sum of all old processing. 15 years ago

  11. Making Chips Smaller • Advances in Lithography: science of "drawing" circuits on chips • Impact of Moore’s law: • Short distances => smaller processing time • Smaller size => lower cost per transistor • Amount of memory is increased • But, it is not a law of physics: a mere self fulfilling prophecy.

  12. Moore’s law not applicable to Machine Intelligence • Speech technology benefited from Moore’s Law in the 1990’s. • In the 21th century, faster chips mean recognition error appears faster  • New algorithmic advances needed to pass the Turing Test • Error rate halves approx every 7 years

  13. Grand Challenges “Within 10 years speech will be in every device. Things like speech and ink are so natural, when they get the right quality level they will be in everything. As technical hurdles such as background noise and context are overcome, major adoption of speech technology will arrive. Soon, dictating to PCs and giving commands to cell phones will be basic modes of interacting with technology” Bill Gates, March 2004

  14. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  15. Speech in Mobile devices

  16. Speech for Students

  17. Speech in cars

  18. Soccer Mom in car

  19. Insurance Agent driving

  20. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  21. Japanese dictation

  22. Telephony: Response point

  23. Directory Assistance Automatic generation of robust grammars Users say “Calabria” or “Calabria restaurant” Nearby cities Is “Calabria restaurant” in Redmond or Kirkland? Some people say the address too “Pizza hut on 3rd Avenue” in New York, New York Automatic normalization Acronyms, compound words, homonyms, misspelled words

  24. Multimodal voice search

  25. Click-Driven Automated Feedback Acoustic Model Language Model

  26. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  27. CommuteUX

  28. Speech in Education

  29. VerbalMath

  30. Virtual Receptionist

  31. Video Search(Frank Seide, MSRA)

  32. Browsing a Video (Milind Mahajan & Patrick Nguyen)

  33. Podcast authoring (Patrick Nguyen)

  34. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  35. Role of Speech in Different Devices Tablet PC PC High Tablet PC Internet TV PDA Internet TV Screen Phone PDA Ease of GUI (screen/ Pointer) Screen Phone Car Phone Car High Low Ease of text input (keyboard/pen)

  36. Tablet PC PC Internet TV PDA Screen Phone Car Phone A Roadmap for Speech Dictation High Multimodal Command/Control Ease of GUI (screen/ Pointer) Speech-Only Telephony High Low Ease of text input (keyboard/pen)

  37. Customer Need Poor Alternative Market Opportunity Technology Readiness Desktop Command & Control Desktop Dictation Meeting / Voicemail Transcription Accessibility Mobile Devices / Cars Telephony / Call Center Speech Technology

  38. Outline • Grand challenges in Speech and Language • Vision videos • Products today • Prototypes • The role of speech • Technology Introduction

  39. Voice-enabled System Technology Components Speech Speech TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words Words SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action Meaning DM DialogManagement

  40. Voice-enabled System Technology Components Speech Speech TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words Words SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action Meaning DM DialogManagement

  41. Basic Formulation • Basic equation of speech recognition is X=X1,X2,…,Xn is the acoustic observation is the word sequence P(X|W) is the acoustic model P(W) is the language model

  42. Speech Recognition TTS ASR SLG SLU DM Acoustic Model Input Speech Pattern Classification (Decoding, Search) “Hello World” Feature Extraction Confidence Scoring (0.9) (0.8) Word Lexicon Language Model

  43. Acoustic Model Feature Extraction Goal: Extract robust features (information) from the speech that are relevant for ASR. Method: Spectral analysis through either a bank-of-filters or through Linear Predictive Coding followed by non-linearity and normalization. Result: Signal compression where for each window of speech samples where 30 or so features are extracted (64,000 b/s -> 5,200 b/s). Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (accents, dialect, style, speaking defects), noise and echo. Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon

  44. Acoustic Model 1 2 0 Acoustic Modeling • Goal: • Model probability of acoustic features • for each phone model i.e. p(X |/ae/) • Method: • Hidden Markov Models (HMM) through • Maximum likelihood (EM) or discriminative methods • Challenges/variability: • Background noise: Cocktail Party Effect • Dialect/accent • Speaker • Phonetic context: “It aly” vs “It alian” • No spaces in speech: Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon “Wreck a nice beach” “Recognize speech”

  45. Acoustic Model Word Lexicon • Goal: • Map legal phone sequences into words • according to phonotactic rules: • David /d/ /ey/ /v/ /ih/ /d/ • Multiple Pronunciations: • Several words may have multiple pronunciations: • Data /d/ /ae/ /t/ /ax/ • Data /d/ /ey/ /t/ /ax/ • Challenges: • How do you generate a word lexicon automatically? • LTS rules can be automatically trained with decision trees (CART) less than 8% errors, but proper nouns are hard! • How do you add new variant dialects and word pronunciations? Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon

  46. Acoustic Model Pattern Classification • Goal: • Find “optimal” word sequence: • Combine information (probabilities) from • Acoustic model • Word lexicon • Language model • Method: • Decoder searches through all possible recognition • choices using a Viterbi decoding algorithm • Challenge: • Efficient search through a large network space is computationally expensive for large vocabulary ASR: Beam search, WFST Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon

  47. Acoustic Model Confidence Scoring Goal: Identify possible recognition errors and out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM. Method: A confidence score based on a hypothesis likelihood ratio test is associated with each recognized word: Label:credit please Recognized: credit fees Confidence: (0.9) (0.3) Command-and-control: false rejection and false acceptance => ROC curves Challenges: Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech. Pattern Classification Confidence Scoring Feature Extraction Language Model Word Lexicon

  48. Voice-enabled System Technology Components Speech Speech TTS ASR Automatic SpeechRecognition Text-to-SpeechSynthesis Data, Rules Words Words SLG SLU Spoken Language Generation Spoken LanguageUnderstanding Action Meaning DM DialogManagement

  49. Text-to-Speech Systems TTS Engine Text Analysis Document Structure Detection Text Normalization Linguistic Analysis Raw text or tagged text tagged text Phonetic Analysis Homograph disambiguation Grapheme-to-Phoneme Conversion tagged phones Prosodic Analysis Pitch & Duration Attachment controls Speech Audio Out Speech Synthesis Voice Rendering

  50. Multimedia Customer Care(Courtesy of AT&T)

More Related