1 / 80

Human/Computer Communications Using Speech

Human/Computer Communications Using Speech. Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com. Famous Human/Computer Communication - 1968. InterVoice-Brite. Twenty years building speech applications Largest provider of VUI applications and systems in the world Turnkey Systems

bern
Download Presentation

Human/Computer Communications Using Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Human/Computer CommunicationsUsing Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com

  2. Famous Human/Computer Communication - 1968

  3. InterVoice-Brite • Twenty years building speech applications • Largest provider of VUI applications and systems in the world • Turnkey Systems • Hardware, software, application design, managed services • 1000’s of installations worldwide • Banking, Travel, Stock Brokerage, Help Desk, etc. • Bank of America • American Express • E-Trade • Microsoft help-desk

  4. Growth of Speech-Enabled Applications • Analysts estimate that 15% of IVR ports sold in 2000 were speech enabled • By 2004, 48.5% of IVR ports sold will be speech-enabled • Source: Frost & Sullivan - U.S. IVR Systems Market, 2001 • IVB estimates that in 2002, 50% of IVB ports sold will be speech enabled.

  5. Overview • Brief History of Speech Recognition • How ASR works • Directed Dialog & Applications • Standards & Trends • Natural Language & Applications

  6. History • Natural Language Processing • Computational Linguistics • Computer Science • Text understanding • Auto translation • Question/Answer • Web search • Speech Recognition • Electrical Engineering • Speech-to-text • Dictation • Control

  7. Turing Test • Alan M. Turing • Paper -”Computing Machinery and Intelligence” (Mind, 1950 - Vol. 59, No. 236, pp. 433-460) • First two sentences of the article: • I propose to consider the question, "Can machines think?” This should begin with definitions of the meaning of the terms "machine" and "think." • To answer this question, Turing proposed the “Imitation Game” later named the “Turing Test” • Requires an Interrogator & 2 subjects

  8. Subject #2 Turing Test Subject #1 Observer Subject #2 Which subject is a machine?

  9. Turing Test • Turing assumed communications would be written (typed) • Assumed communications would be unrestricted as to subject • Predicted that test would be “passed” in 50 years (2000) • The ability to communicate is equated to “Thinking” and “intelligence”

  10. Turing Test - 50 Years Later • Today - NL systems still unable to fool interrogator on unrestricted subjects • Speech Input & Output possible • Transactional dialogs in restricted subject areas possible - • Question/Answer queries feasible on large text databases • May not fool the interrogator, but can provide useful functions • Travel Reservations, Stock Brokerages, Banking, etc.

  11. Speech Recognition

  12. Voice Input - The New Paradigm • Automatic Speech Recognition (ASR) • Tremendous technical advances in the last few years • From small to large vocabularies • 5,000 - 10,000 word vocabulary • Stock brokerage - E-Trade - Ameritrade • Travel - Travelocity, Delta Airlines • From isolated word to connected words • Modern ASR recognizes connected words • From speaker dependent to speaker independent • Modern ASR is fully speaker independent • Natural Language

  13. Signal Processing Front-End Feature Extraction 13 Parameters 13 Parameters 13 Parameters

  14. Overlapping Sample Windows 25 ms Sample - 15ms overlap - 100 samples/sec.

  15. Cepstrum • Cepstrum is the inverse Fourier transform of the log spectrum

  16. Mel Cepstral Coefficients • Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples: Mel-Filtered Cepstral Coefficients Most common feature set for recognizers Motivated by human auditory response characteristics

  17. Mel Cepstrum • After computing the DFT, and the log magnitude spectrum (to obtain the real cepstrum), we compute the filterbank outputs, and then use a discrete cosine transform to compute the mel-frequency cepstrum coefficients: • Mel Cepstrum • 39 Feature vectors representing on 25ms voice sample

  18. Cepstrum as Vector Space Features

  19. Feature Ambiguity • After the signal processing front-end • How to resolve overlap or ambiguity in Mel-Cepstrum features • Need to use context information • What preceeds? What follows? • N-phones and N-grams • All probabalistic computations

  20. A tractable reformulation of the problem is: Acoustic model Language model Daunting search task The Speech Recognition Problem Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence A

  21. ASR Resolution • Need • Mel Cepstrum features into probabilities • Acoustic Model (tri-phone probabilities) • Phonetic probabilities • Language Model (bi-gram probabilities) • Word probabilities • Apply Dynamic Programming techniques • Find most-likely sequence of phonemes & words • Viterbi Search

  22. Acoustic Models • Acoustic states represented by Hidden Markov Models (HMMs) • Probabilistic State Machines - state sequence unknown, only feature vector outputs observed • Each state has output symbol distribution • Each state has transition probability distribution t(s0 |s0) t(s1 |s1) t(s2 |s2) t(s2 |s1) t(s1 |s0) s0 s1 s2 p(s0) q(i|s0) q(i|s1) q(i|s2)

  23. Subword Models • Objective: Create a set of HMM’s representing the basic sounds (phones) of a language? • English has about 40 distinct phonemes • Need “lexicon” for pronunciations • Letter to sound rules for unusual words • Problem - co-articulation effects must be modeled • “barter” vs “bartender” • Solution - “tri-phones” - each phone modified by onset and trailing context phones

  24. Language Models • What is a language model? • Quantitative ordering of the likelihood of word sequences • Why use language models? • Not all word sequences equally likely • Search space optimization • Improved accuracy • Bridges the gap between acoustic ambiguities and ontology

  25. Finite State Grammars Allowable word sequences are explicitly specified using a structured syntax • Creates a word network • Words sequences not enabled do not exist! • Application developer must construct grammar • Excellent for directed dialog and closed prompting

  26. Finite-State Language Model • Narrow range of responses allowed • Only word sequences coded in grammar are recognized • Straightforward ASR engine. Follows grammar rules exactly • Easy to add words to grammar • Allows name lists • “I want to fly to $CITY” • “I want to buy $STOCK”

  27. Statistical Language Models Stochastic Context-Free Grammars • Only specifies word transition probabilities • N-gram language model • Required for open ended prompts: “How may I direct your inquiry?” • Much more difficult to analyze possible results • Not for every interaction • Data, Data, Data: 10,000+ transcribed responses for each input task

  28. Statistical State Machines

  29. Mixed Language Models • SLM statistics are unstable (useless) unless examples of each word in each context are presented • Consider a flight reservation tri-gram language model: I’d like to fly from Boston to Chicago on Monday Training sentences required for 100 cities: (100*100 + 100*7) = 10,700 • A better way is to consider classes of words: I’d like to fly from $(CITY) to $(CITY) on $(DATE) Only one transcription is needed to represent 70,000 variations

  30. Viterbi • How do you determine the most probable utterance? • The Viterbi Search returns the n-best paths through the Acoustic model and the Language Model

  31. Dynamic Programming (Viterbi)

  32. Speech Waveform Grammar N-Best Speech Results N-Best Result ASR N=1 N=2 N=3 “Get me two movie tickets…”“I want to movie trips…” “My car’s too groovy” • ASR converts speech to text • Use “grammar” to guide recognition • Focus on “speaker independent” ASRs • Must allow for open context

  33. What does it all Mean? Text output is nice, but how do we represent meaning ? • Finite state grammars - constructs can be tagged with semantics <item> get me the operator <tag>OPERATOR</tag> </item> • SLM uses concept spotting Itinerary:slm “flightinfo.pfsg” = FlightConcepts FlightConcepts [ (from City:c) {<origin $c>} (to City:c) {<dest $c>} (on Date:d) {<date $d>} ] • Concepts may also be trained statistically • but that requires even more data!

  34. Directed Dialogs

  35. Directed Dialog • Finite-State Grammars - Currently most common method to implement speech-enabled applications • More flexible & user-friendly than key (Touch-Tone) input • Allows Spoken List selection • System: “What City are you leaving from?” • User: “Birmingham” • Keywords easier to remember than numeric codes • “Account balance” instead of “two” • Easy to skip ahead through menus • Tellme - “Sports, Basketball, Mavericks”

  36. Issues With Directed Dialogue • Computer asks all the questions • Usually presented as a menu • “Do you want your account balance, cleared checks, or deposits?” • Computer always has the initiative • User just answers questions, never gets to ask any questions • All possible answers must be pre-defined by the application developer (grammars) • Will eventually get the job done, but can be tedious • Still much better than Touch-tone menus

  37. Issues With Directed Dialogue • Application developer must design scripts that never have the machine ask open-ended questions • “What can I do for you?” • Application Developer’s job - design questions where answers can be explicitly predicted. • “Do you want to buy or sell stocks” • Developer must explicitly define all possible responses • Buy, purchase, get some, acquire • Sell, dump, get rid of it

  38. Examples of Directed Dialog Southwest Airlines Pizza Inn Brokerage

  39. Standards & Trends

  40. VoiceXML • VoiceXML - A web-oriented voice-application programming language • W3C Standard - www.w3.org • Version 1.0 released March 2000 • Version 2.0 ready to be approved • http://www.w3.org/TR/voicexml20/ • Voice dialogues scripted using XML structures • Other VoiceXML support • www.voicexml.org • voicexmlreview.org

  41. VoiceXML • Assume telephone as user device • Voice or key input • Pre-recorded or Text-to-Speech output

  42. Why VoiceXML? • Provides environment similar to web for web developers to build speech applications • Applications are distributed on document servers similar to web • Leverages the investment companies have made in the development of a web presence. • Data from Web databases can be used in the call automation system. • Designed for distributed and/or hosted (ASP) environment.

  43. Web Server Internet VoiceXML Browser Telephone Network VoiceXML Browser/ Gateway Voice Serve VoiceXML Document Web Server Mobile Device VUI VoiceXML Architecture

  44. VoiceXML Example <?xml version="1.0"?> <vxml version="1.0"> <!--Example 1 for VoiceXML Review --> <form> <block> Hello, World! </block> </form></vxml>

  45. VoiceXML Applications • Voice Portals • TellMe, • 1-800-555-8355 (TELL) • http://www.tellme.com • BeVocal • 1-408-850-2255 (BVOCAL) • www.bevocal.com

  46. The VoiceXML Plan • Third party developers write VoiceXML scripts that they will publish on the web • Callers to the Voice Portals will access these voice applications like browsing the web • VoiceXML will use VUI with directed dialog • Voice output • Voice or key input • hands/eyes free or privacy

  47. Speech Application Language Tags (SALT) • Microsoft, Cisco Systems, Comverse Inc., Intel, Philips Speech Processing, and SpeechWorks • www.saltforum.org • Extension of existing Web standards such as HTML, xHTML and XML • Support multi-modal and telephone access to information, applications, and Web services, independently or concurrently.

  48. SALT - “Multi-modal” • Input might come from speech recognition, a keyboard or keypad, and/or a stylus or mouse • Output to screen or speaker (speech) • Embedded in HTML documents • Will require SALT-enabled browsers • Working Draft V1.9 • Public Release - March 2002 • Submit to IETF - midyear 2002

  49. SALT Code • <!—- Speech Application Language Tags --> • <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt> • <salt:prompt id="askDestCity"> Where would you like to go to? </salt:prompt> • <salt:prompt id="sayDidntUnderstand" onComplete="runAsk()"> • Sorry, I didn't understand. </salt:prompt> • <salt:listen id="recoOriginCity" • onReco="procOriginCity()” onNoReco="sayDidntUnderstand.Start()"> • <salt:grammar src="city.xml" /> • </salt:listen> • <salt:listen id="recoDestCity" • onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()"> • <salt:grammar src="city.xml" /> </salt:listen>

  50. Evolution of the Speech Interface • Touch-Tone Input • Directed Dialogue • Natural Language • Word spotting • Phrase spotting • Deep parsing

More Related