1 / 55

Application of Speech Recognition, Synthesis, Dialog

Application of Speech Recognition, Synthesis, Dialog. Speech for communication. The difference between speech and language Speech recognition and speech understanding. Speech recognition can only identify words. System does not know what you want System does not know who you are.

Download Presentation

Application of Speech Recognition, Synthesis, Dialog

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Speech Recognition, Synthesis, Dialog

  2. Speech for communication • The difference between speech and language • Speech recognition and speech understanding

  3. Speech recognition can only identify words System does not know what you want System does not know who you are

  4. Speech and Audio Processing • Signal processing: • Convert the audio wave into a sequence of feature vectors • Speech recognition: • Decode the sequence of feature vectors into a sequence of words • Semantic interpretation: • Determine the meaning of the recognized words • Dialog Management: • Correct errors and help get the task done • Response Generation • What words to use to maximize user understanding • Speech synthesis: • Generate synthetic speech from a ‘marked-up’ word string

  5. Data Flow Part I Part II Semantic Interpretation Signal Processing Speech Recognition Discourse Interpretation Dialog Management Speech Synthesis Response Generation

  6. Semantic Interpretation: Word Strings • Content is just words • System: What is your address? • User: My address is fourteen eleven main street • Need concept extraction / keyword(s) spotting • Applications • template filling • directory services • information retrieval

  7. Semantic Interpretation: Pattern-Based • Simple (typically regular) patterns specify content • ATIS (Air Traffic Information System) Task: • System: What are your travel plans? • User: [On Monday], I’m going [from Boston] [to San Francisco]. • Content: [DATE=Monday, ORIGIN=Boston, DESTINATION=SFO]

  8. Robustness and Partial Success • Controlled Speech • limited task vocabulary; limited task grammar • Spontaneous Speech • Can have high out-of-vocabulary (OOV) rate • Includes restarts, word fragments, omissions, phrase fragments, disagreements, and other disfluencies • Contains much grammatical variation • Causes high word error-rate in recognizer • Interpretation is often partial, allowing: • omission • parsing fragments

  9. Speech Dialog Management

  10. Discourse & Dialog Processing • Discourse interpretation: • Understand what the user really intends by interpreting utterances in context • Dialog management: • Determine system goals in response to user utterances based on user intention • Response generation: • Generate natural language utterances to achieve the selected goals

  11. Discourse Interpretation • Goal: understand what the user really intends • Example: Can you move it? • What does “it” refer to? • Is the utterance intended as a simple yes-no query or a request to perform an action? • Issues addressed: • Reference resolution • Intention recognition • Interpret user utterances in context

  12. Reference Resolution U: Where is A Bug’s Life playing in Monroeville? S: A Bug’s Life is playing at the Carmike theater. U: When is it playing there? S: It’s playing at 2pm, 5pm, and 8pm. U: I’d like 1 adult and 2 children for the first show. How much would that be? • Knowledge sources: • Domain knowledge • Discourse knowledge • World knowledge

  13. Reference Resolution • Focus stacks: • Maintain recent objects in stack • Select objects that satisfy semantic/pragmatic constraints starting from top of stack • Take into account discourse structure • Rule-based filtering & ranking of objects for pronoun resolution

  14. Dialog Management: Motivating Example S: Would you like movie showtime or theater playlist information? U: Movie showtime. S: What movie do you want showtime information about? U: Saving Private Ryan. S: At what theater do you want to see Saving Private Ryan? U: Carmike. S: Saving Private Ryan is not playing at the Carmike theater.

  15. Interacting with the user Dialog manager • Guide interaction through task • Map user inputs and system state into actions Domain agent • Interact with back-end(s) • Interpret information using domain knowledge Domain agent Domain agent

  16. Dialog Management • Goal: determine what to accomplish in response to user utterances, e.g.: • Answer user question • Solicit further information • Confirm/Clarify user utterance • Notify invalid query • Notify invalid query and suggest alternative • Interface between user/language processing components and system knowledge base

  17. Graph-based systems Welcome to Bank ABC! Please say one of the following: Balance, Hours, Loan, ... What type of loan are you interested in? Please sayone of the following: Mortgage, Car, Personal, ... . . . .

  18. Frame-based systems Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Transition on keyword or phrase Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .

  19. Application Task Complexity • Examples: Weather Information ATIS Call Routing Automatic Banking Travel Planning University Course Advising Simple Complex • Directly affects: • Types and quantity of system knowledge • Complexity of system’s reasoning abilities

  20. Dialog Complexity • Determines what can be talked about: • The task only • Subdialog: e.g., clarification, confirmation • The dialog itself: meta-dialog • Could you hold on for a minute? • What was that click? Did you hear it? • Determines who can talk about them: • System only • User only • Both participants

  21. Dialogue Management: Process • Determines how the system will go about selecting among the possible goals • At the dialogue level, determined by system designer in terms of initiative strategies: • System-initiative: system always has control, user only responds to system questions • User-initiative: user always has control, system passively answers user questions • Mixed-initiative: control switches between system and user using fixed rules • Variable-initiative: control switches between system and user dynamically based on participant roles, dialogue history, etc.

  22. Response Generation U: Is Saving Private Ryan playing at the Chatham cinema?

  23. S provides elliptical response S: No, it’s not.

  24. S provides full response (which provides grounding information) S: No, Saving Private Ryan is not playing at the Chatham cinema.

  25. S provides full response and supporting evidence S: No, Saving Private Ryan is not playing at the Chatham cinema; the theater’s under renovation.

  26. Communicating with the user Language Generator • Decide what to say to user (and how to phrase it) Speech synthesizer • Construct sounds and intonation Display Generator Action Generator

  27. Response Generation • Goal: generate natural language utterances to achieve goal(s) selected by the dialogue manager • Issues: • Content selection: determining what to say • Surface realization: determining how to say it • Generation gap: discrepancy between the actual output of the content selection process and the expected input of the surface realization process

  28. Language generation • Template-based systems • Sentence templates with variables • “Linguistic” systems • Generate surface from meaning representation • Stochastic approaches • Statistical models of domain-expert speech

  29. Dialog Evaluation • Goal: determine how “well” a dialogue system performs • Main difficulties: • No strict right or wrong answers • Difficult to determine what features make a dialogue system better than another • Difficult to select metrics that contribute to the overall “goodness” of the system • Difficult to determine how the metrics compensate for one another • Expensive to collect new data for evaluating incremental improvement of systems

  30. System-initiative, explicit confirmation better task success rate lower WER longer dialogs fewer recovery subdialogs less natural Mixed-initiative, no confirmation lower task success rate higher WER shorter dialogs more recovery subdialogs more natural Dialog Evaluation (Cont’d)

  31. Speech Synthesis

  32. Speech Synthesis (Text-to-Speech TTS) • Prior knowledge • Vocabulary from words to sounds; surface markup • Recorded prompts • Formant synthesis • Model vocal tract as source and filters • Concatenative synthesis • Record and segment expert’s voice • Splice appropriate units into full utterances • Intonation modeling

  33. Recorded Prompts • The simplest (and most common) solution is to record prompts spoken by a (trained) human • Produces human quality voice • Limited by number of prompts that can be recorded • Can be extended by limited cut-and-paste or template filling

  34. The Source-Filter Model of Formant Synthesis • Model of features to be extracted and fitted • Excitation or Voicing Source(s) to model sound source • standard wave of glottal pulses for voiced sounds • randomly varying noise for unvoiced sounds • modification of airflow due to lips, etc. • high frequency (F0 rate), quasi-periodic, choppy • modeled with vector of glottal waveform patterns in voiced regions • Acoustic Filter(s) • shapes the frequency character of vocal tract and radiation character at the lips • relatively slow (samples around 5ms suffice) and stationary • modeled with LPC (linear predictive coding)

  35. Concatenative Synthesis • Record basic inventory of sounds • Retrieve appropriate sequence of units at run time • Concatenate and adjust durations and pitch • Synthesize waveform

  36. Diphone and Polyphone Synthesis • Phone sequences capture co-articulation • Cut speech in positions that minimize context contamination • Need single phones, diphones and sometimes triphones • Reduce number collected by • phonotactic constraints • collapsing in cases of no co-articulation • Data Collection Methods • Collect data from a single (professional) speaker • Select text with maximal coverage (typically with greedy algorithm), or • Record minimal pairs in desired contexts (real words or nonsense)

  37. Signal Processing for Concatenative Synthesis • Diphones recorded in one context must be generated in other contexts • Features are extracted from recorded units • Signal processing manipulates features to smooth boundaries where units are concatenated • Signal processing modifies signal via ‘interpolation’ • intonation • duration

  38. Intonation in Bell Labs TTS • Generate a sequence of F0 targets for synthesis • Example: • We were away a year ago. • phones: w E w R & w A & y E r & g O source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998

  39. What you can do with Speech Recognition • Transcription • dictation, information retrieval • Command and control • data entry, device control, navigation, call routing • Information access • airline schedules, stock quotes, directory assistance • Problem solving • travel planning, logistics

  40. Human-machine interface is critical Speech recognition is NOT the core function of most applications Speech is afeature of applications that offers specific advantages Errorful recognition is a fact of life

  41. Properties of Recognizers • Speaker Independent vs. Speaker Dependent • Large Vocabulary (2K-200K words) vs. Limited Vocabulary (2-200) • Continuous vs. Discrete • Speech Recognition vs. Speech Verification • Real Time vs. multiples of real time • Spontaneous Speech vs. Read Speech • Noisy Environment vs. Quiet Environment • High Resolution Microphone vs. Telephone vs. Cellphone • Push-and-hold vs. push-to-talk vs. always-listening • Adapt to speaker vs. non-adaptive • Low vs. High Latency • With online incremental results vs. final results • Dialog Management

  42. Speech Recognition vs. Touch Tone • Shorter calls • Choices mean something • Automate more tasks • Reduces annoying operations • Available

  43. Transcription and Dictation • Transcription is transforming a stream of human speech into computer-readable form • Medical reports, court proceedings, notes • Indexing (e.g., broadcasts) • Dictation is the interactive composition of text • Report, correspondence, etc.

  44. SpeechWear • Vehicle inspection task • USMC mechanics, fixed inspection form • Wearable computer (COTS components) • html-based task representation • film clip

  45. Speech recognition and understanding • Sphinx system • speaker-independent • continuous speech • large vocabulary • ATIS system • air travel information retrieval • context management • film clip(1994)

  46. Automate services, lower payroll Shorten time on hold Shorten agent and client call time Reduce fraud Improve customer service Sample Market: Call Centers

  47. Interface guidelines • State transparency • Input control • Error recovery • Error detection • Error correction • Log performance • Application integration

  48. Applications related to Speech Recognition Speech Recognition Figure out what a person is saying. Speaker Verification Authenticate that a person is who she/he claims to be. Limited speech patterns Speaker Identification Assigns an identity to the voice of an unknown person. Arbitrary speech patterns

More Related