Human/Computer Communications Using Speech

Human/Computer CommunicationsUsing Speech Ellis K. ‘Skip’ Cave InterVoice-Brite Inc. skip@intervoice.com

Famous Human/Computer Communication - 1968

InterVoice-Brite • Twenty years building speech applications • Largest provider of VUI applications and systems in the world • Turnkey Systems • Hardware, software, application design, managed services • 1000’s of installations worldwide • Banking, Travel, Stock Brokerage, Help Desk, etc. • Bank of America • American Express • E-Trade • Microsoft help-desk

Growth of Speech-Enabled Applications • Analysts estimate that 15% of IVR ports sold in 2000 were speech enabled • By 2004, 48.5% of IVR ports sold will be speech-enabled • Source: Frost & Sullivan - U.S. IVR Systems Market, 2001 • IVB estimates that in 2002, 50% of IVB ports sold will be speech enabled.

Overview • Brief History of Speech Recognition • How ASR works • Directed Dialog & Applications • Standards & Trends • Natural Language & Applications

History • Natural Language Processing • Computational Linguistics • Computer Science • Text understanding • Auto translation • Question/Answer • Web search • Speech Recognition • Electrical Engineering • Speech-to-text • Dictation • Control

Turing Test • Alan M. Turing • Paper -”Computing Machinery and Intelligence” (Mind, 1950 - Vol. 59, No. 236, pp. 433-460) • First two sentences of the article: • I propose to consider the question, "Can machines think?” This should begin with definitions of the meaning of the terms "machine" and "think." • To answer this question, Turing proposed the “Imitation Game” later named the “Turing Test” • Requires an Interrogator & 2 subjects

Subject #2 Turing Test Subject #1 Observer Subject #2 Which subject is a machine?

Turing Test • Turing assumed communications would be written (typed) • Assumed communications would be unrestricted as to subject • Predicted that test would be “passed” in 50 years (2000) • The ability to communicate is equated to “Thinking” and “intelligence”

Turing Test - 50 Years Later • Today - NL systems still unable to fool interrogator on unrestricted subjects • Speech Input & Output possible • Transactional dialogs in restricted subject areas possible - • Question/Answer queries feasible on large text databases • May not fool the interrogator, but can provide useful functions • Travel Reservations, Stock Brokerages, Banking, etc.

Speech Recognition

Voice Input - The New Paradigm • Automatic Speech Recognition (ASR) • Tremendous technical advances in the last few years • From small to large vocabularies • 5,000 - 10,000 word vocabulary • Stock brokerage - E-Trade - Ameritrade • Travel - Travelocity, Delta Airlines • From isolated word to connected words • Modern ASR recognizes connected words • From speaker dependent to speaker independent • Modern ASR is fully speaker independent • Natural Language

Signal Processing Front-End Feature Extraction 13 Parameters 13 Parameters 13 Parameters

Overlapping Sample Windows 25 ms Sample - 15ms overlap - 100 samples/sec.

Cepstrum • Cepstrum is the inverse Fourier transform of the log spectrum

Mel Cepstral Coefficients • Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples: Mel-Filtered Cepstral Coefficients Most common feature set for recognizers Motivated by human auditory response characteristics

Mel Cepstrum • After computing the DFT, and the log magnitude spectrum (to obtain the real cepstrum), we compute the filterbank outputs, and then use a discrete cosine transform to compute the mel-frequency cepstrum coefficients: • Mel Cepstrum • 39 Feature vectors representing on 25ms voice sample

Cepstrum as Vector Space Features

Feature Ambiguity • After the signal processing front-end • How to resolve overlap or ambiguity in Mel-Cepstrum features • Need to use context information • What preceeds? What follows? • N-phones and N-grams • All probabalistic computations

A tractable reformulation of the problem is: Acoustic model Language model Daunting search task The Speech Recognition Problem Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence A

ASR Resolution • Need • Mel Cepstrum features into probabilities • Acoustic Model (tri-phone probabilities) • Phonetic probabilities • Language Model (bi-gram probabilities) • Word probabilities • Apply Dynamic Programming techniques • Find most-likely sequence of phonemes & words • Viterbi Search

Acoustic Models • Acoustic states represented by Hidden Markov Models (HMMs) • Probabilistic State Machines - state sequence unknown, only feature vector outputs observed • Each state has output symbol distribution • Each state has transition probability distribution t(s0 |s0) t(s1 |s1) t(s2 |s2) t(s2 |s1) t(s1 |s0) s0 s1 s2 p(s0) q(i|s0) q(i|s1) q(i|s2)

Subword Models • Objective: Create a set of HMM’s representing the basic sounds (phones) of a language? • English has about 40 distinct phonemes • Need “lexicon” for pronunciations • Letter to sound rules for unusual words • Problem - co-articulation effects must be modeled • “barter” vs “bartender” • Solution - “tri-phones” - each phone modified by onset and trailing context phones

Language Models • What is a language model? • Quantitative ordering of the likelihood of word sequences • Why use language models? • Not all word sequences equally likely • Search space optimization • Improved accuracy • Bridges the gap between acoustic ambiguities and ontology

Finite State Grammars Allowable word sequences are explicitly specified using a structured syntax • Creates a word network • Words sequences not enabled do not exist! • Application developer must construct grammar • Excellent for directed dialog and closed prompting

Finite-State Language Model • Narrow range of responses allowed • Only word sequences coded in grammar are recognized • Straightforward ASR engine. Follows grammar rules exactly • Easy to add words to grammar • Allows name lists • “I want to fly to $CITY” • “I want to buy $STOCK”

Statistical Language Models Stochastic Context-Free Grammars • Only specifies word transition probabilities • N-gram language model • Required for open ended prompts: “How may I direct your inquiry?” • Much more difficult to analyze possible results • Not for every interaction • Data, Data, Data: 10,000+ transcribed responses for each input task

Statistical State Machines

Mixed Language Models • SLM statistics are unstable (useless) unless examples of each word in each context are presented • Consider a flight reservation tri-gram language model: I’d like to fly from Boston to Chicago on Monday Training sentences required for 100 cities: (100*100 + 100*7) = 10,700 • A better way is to consider classes of words: I’d like to fly from $(CITY) to $(CITY) on $(DATE) Only one transcription is needed to represent 70,000 variations

Viterbi • How do you determine the most probable utterance? • The Viterbi Search returns the n-best paths through the Acoustic model and the Language Model

Dynamic Programming (Viterbi)

Speech Waveform Grammar N-Best Speech Results N-Best Result ASR N=1 N=2 N=3 “Get me two movie tickets…”“I want to movie trips…” “My car’s too groovy” • ASR converts speech to text • Use “grammar” to guide recognition • Focus on “speaker independent” ASRs • Must allow for open context

What does it all Mean? Text output is nice, but how do we represent meaning ? • Finite state grammars - constructs can be tagged with semantics <item> get me the operator <tag>OPERATOR</tag> </item> • SLM uses concept spotting Itinerary:slm “flightinfo.pfsg” = FlightConcepts FlightConcepts [ (from City:c) {<origin $c>} (to City:c) {<dest $c>} (on Date:d) {<date $d>} ] • Concepts may also be trained statistically • but that requires even more data!

Directed Dialogs

Directed Dialog • Finite-State Grammars - Currently most common method to implement speech-enabled applications • More flexible & user-friendly than key (Touch-Tone) input • Allows Spoken List selection • System: “What City are you leaving from?” • User: “Birmingham” • Keywords easier to remember than numeric codes • “Account balance” instead of “two” • Easy to skip ahead through menus • Tellme - “Sports, Basketball, Mavericks”

Issues With Directed Dialogue • Computer asks all the questions • Usually presented as a menu • “Do you want your account balance, cleared checks, or deposits?” • Computer always has the initiative • User just answers questions, never gets to ask any questions • All possible answers must be pre-defined by the application developer (grammars) • Will eventually get the job done, but can be tedious • Still much better than Touch-tone menus

Issues With Directed Dialogue • Application developer must design scripts that never have the machine ask open-ended questions • “What can I do for you?” • Application Developer’s job - design questions where answers can be explicitly predicted. • “Do you want to buy or sell stocks” • Developer must explicitly define all possible responses • Buy, purchase, get some, acquire • Sell, dump, get rid of it

Examples of Directed Dialog Southwest Airlines Pizza Inn Brokerage

Standards & Trends

VoiceXML • VoiceXML - A web-oriented voice-application programming language • W3C Standard - www.w3.org • Version 1.0 released March 2000 • Version 2.0 ready to be approved • http://www.w3.org/TR/voicexml20/ • Voice dialogues scripted using XML structures • Other VoiceXML support • www.voicexml.org • voicexmlreview.org

VoiceXML • Assume telephone as user device • Voice or key input • Pre-recorded or Text-to-Speech output

Why VoiceXML? • Provides environment similar to web for web developers to build speech applications • Applications are distributed on document servers similar to web • Leverages the investment companies have made in the development of a web presence. • Data from Web databases can be used in the call automation system. • Designed for distributed and/or hosted (ASP) environment.

Web Server Internet VoiceXML Browser Telephone Network VoiceXML Browser/ Gateway Voice Serve VoiceXML Document Web Server Mobile Device VUI VoiceXML Architecture

VoiceXML Example <?xml version="1.0"?> <vxml version="1.0">  <form> <block> Hello, World! </block> </form></vxml>

VoiceXML Applications • Voice Portals • TellMe, • 1-800-555-8355 (TELL) • http://www.tellme.com • BeVocal • 1-408-850-2255 (BVOCAL) • www.bevocal.com

The VoiceXML Plan • Third party developers write VoiceXML scripts that they will publish on the web • Callers to the Voice Portals will access these voice applications like browsing the web • VoiceXML will use VUI with directed dialog • Voice output • Voice or key input • hands/eyes free or privacy

Speech Application Language Tags (SALT) • Microsoft, Cisco Systems, Comverse Inc., Intel, Philips Speech Processing, and SpeechWorks • www.saltforum.org • Extension of existing Web standards such as HTML, xHTML and XML • Support multi-modal and telephone access to information, applications, and Web services, independently or concurrently.

SALT - “Multi-modal” • Input might come from speech recognition, a keyboard or keypad, and/or a stylus or mouse • Output to screen or speaker (speech) • Embedded in HTML documents • Will require SALT-enabled browsers • Working Draft V1.9 • Public Release - March 2002 • Submit to IETF - midyear 2002

SALT Code • <!—- Speech Application Language Tags --> • <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt> • <salt:prompt id="askDestCity"> Where would you like to go to? </salt:prompt> • <salt:prompt id="sayDidntUnderstand" onComplete="runAsk()"> • Sorry, I didn't understand. </salt:prompt> • <salt:listen id="recoOriginCity" • onReco="procOriginCity()” onNoReco="sayDidntUnderstand.Start()"> • <salt:grammar src="city.xml" /> • </salt:listen> • <salt:listen id="recoDestCity" • onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()"> • <salt:grammar src="city.xml" /> </salt:listen>

Evolution of the Speech Interface • Touch-Tone Input • Directed Dialogue • Natural Language • Word spotting • Phrase spotting • Deep parsing

Human/Computer Communications Using Speech