Speech: The Ultimate Interface?

CS 260: Lecture 10 Professor John Canny

Speech: the Ultimate Interface? • In the early days of HCI, people assumed that speech/natural language would be the ultimate UI • Use of speech interfaces has grown, but it’s still rarely used in the office. Why?

Speech: the Ultimate Interface? • Why speech hasn’t succeeded in the office: • Affordances of text: • Visual scanning (for email or docs) • Unambiguity of text • Editing of text • Disadvantages of speech: • Noise – call center ambience • Lack of privacy

Speech: the Ultimate Interface? • Use of speech interfaces has grown, but it’s still rarely used in the office.

Computing is Moving Where are computers these days? Intel’s breakdown (based on PC sales): • Office • Home • Mobile (laptops) • Medical And as we noted earlier, programmable smartphones will soon outnumber total PCs. Then there are game boxes, cable boxes, Smart TVs etc.

What is a good interface for: • Mobile computing (walking or driving)? • Home computing? • Medical computing?

Where is the industry now?: • After a big slump around 2002, the speech technology/voice interface industry seems to be growing briskly, about 30-40% per year. One current estimate put it at about $2.5 Billion. • It would probably be more visible, except several related industries have overtaken it: outsourced call centers, and VOIP (Voice Over IP). • The biggest growth has been in the new markets: • Cell phones (as a local UI) • Medical (e.g. order entry) • Voice services over the phone

Industry movement In January this year, Yahoo acquired a large team of speech engineers from Nuance, the largest speech company (which owns Dragon NaturallySpeaking). Google already had some leading speech researchers. So there is much interest in speech for the portal market. Aside: there is a division of Nuance devoted to medical speech recognition, and one to call centers.

Industry movement Heyanita: Voice based email and messaging Bevocal: Hosted IVR (Interactive Voice Response) for customers, e.g. MetroPCS Tellme: Find a business service (including restaurants) using ASR.

Speech: Some background A speech recognizer consists of 3 stages: A state-of-the-art recognizer requires 50-100 Mflops for continuous speech (no pauses between words). PC continuous speech recognizers appeared in the 1990s …and saved many victims of RSI. Rawsound Phoneticfeatures Acousticfeatures AcousticFront End Language/phoneticmodel AcousticModel Words

Speech: Some background The first two stages are standard. The last is not, and has a big impact on performance. The last box encodes knowledge of what users might say, either as a grammar, or as a statistical language model (LM). Grammars are suitable for small recognition tasks with well-known command languages. Rawsound Phoneticfeatures Acousticfeatures AcousticFront End Language/phoneticmodel AcousticModel Words

Speech UIs • Most implement a finite-state machine. • At each state, the system can recognize various speech segments to take you to the next state(s). • A segment may be a word, through to a complete utterance. • The system can also make utterances of its own at various states. • You can specify them usingregular expressions, or using VoiceML.

Speech on phones Speech recognition is faster and more accurate if you limit the vocabulary to a few dozen words. Small-vocabulary speech recognition has been common on phones for the last few years: • Call a number • Call a name (from your contacts) What about large vocabulary, continuous speech?

200 mips This year’s Smart phone This year’s Smartphone (free with service contract) • 150-200 MHz ARM processor • 32 MB ram • 2 GB flash (not included) Windows-98 PC that boots quickly! Plus: • Camera • AGPS (Qualcomm/Snaptrack) • DSP cores, OpenGL GPU • EV-DO (300 kb/s), Bluetooth

Speech on phones This is just the right power for high-performance speech recognition. Large-vocabulary speech recognition(not continuous) appeared on phone’s last year: Samsung P207 LVCSR (Large-Vocabulary ContinuousSpeech Recognition) should be available this year.

Speech in the home Good speech recognition used to require careful microphone placement and a worn headset.

Speech in the home New microphones: array mics with builtin DSPs allow recognition at greater range (several feet). Users don’t have to wear microphones any more to use speech.

Speech in the home Apart from CPU and memory (which are shrinking), speech recognition requires only a microphone and perhaps a speaker. It is power and size efficient. In a few years, it will probably be possible to build speech recognition into bluetooth microphones, or other small devices. Compare with other interfaces…

Ten Guidelines for Speech Interfaces • You can’t design what you can’t define • Use user-centered design techniques • Use the right technology, and use technology right • Leverage the language instinct • Establish success criteria and test against them • Branding in VUI is more than just a pretty voice • How you say it is as important as what you say • Don’t block the exit • Take care with error handling • Establish a change process

1. You can’t design what you can’t define • Consider the task(s) that your users want to do, i.e. start with standard task analysis. • What conceptual model do they have (use contextual inquiry)? • What language do they use to refer to it? • Use recordings during contextual inquiry/task analysis.

2. Use user-centered design techniques • Great to see this advice in a trade publication. You know a lot about this: • Study real use context – especially important for mobile devices, medical, home etc. • Performs needs analysis – what kinds of service might the system provide and how valuable are they? • Develop personae to guide your design • Once again, study user’s conceptual models

3. Use the right technology, and use technology right • In a speech interface, you have a choice between synthesized and recorded speech for output. • In designing the recognizer, language-models will generally give better results for routing a broad range of user questions. • Using technology right: speech recognizers are fussy animals. They use many parameters to trade-off performance and accuracy. You have to experiment with these in order to understand them.

4. Leverage the Language Instinct Make a voice UI resemble natural speech: • Use familiar phrasing • Don’t mimic written language • Use conversational style (pronouns, acknowledgements, transition words) • Use realistic prosody (pitch etc.) in TTS • Enable callers to speak over and interrupt the TTS system

5. Establish Success Criteria and Test Against them • Standard tests: recognition accuracy, speed, CPU • Dialog traversal tests: capture many conversations and plot the paths through your dialog hierarchy that users took. • Usability testing • Early rapid prototyping: WOZ testing • Define “call success” in a sensible way, and track it!

6. Branding is more than a pretty voice • Users make strong attributions about a human speaker (personality, education, demographics). They do the same with speech interfaces (whether you intend it or not). • Design of a voice UI is as significant as design of an attractive web site. A “robot” voice UI is like a 12-point text-only web site. • The voice interface’s brand perception is a combination of prosody and language, just like a real speaker’s. Design both explicitly.

7. How you say it is as important as what you say Mostly about speech constructed from recorded voice. • For “natural” speech, you need to think about the context of each word in real speech. • Pronunciation actually changes when words are connected together (this is co-articulation). • Ideally, you would include appropriate context information in each recording (e.g. the number “one” followed by a “t” consonant).

8. Don’t block the exit • Make sure users can exit the automated system and reach a live person. • If you make it hard, they will get there anyway, and be angry when they do. • Providing feedback can help (e.g. the estimated time to reach a representative is…, do you wish to return to the automated system?). • Make sure you transfer user data from the automated system to the service person’s console – it looks really bad if you don’t.

9. Take Care with Error Handling • Most speech dialog systems have internal state (in a state machine) that the user can’t see except through what the system says. • You must treat errors (e.g. unrecognized utterances) very carefully. If you leave the current state, make sure users can understand the state you’ve gone into. • Large changes (e.g. backtracking up to the initial state) is extremely frustrating for users. • If you backtrack, take small steps, only as much as needed.

10. Establish a Change Process • Speech UIs are very complex, and very sensitive to some small changes (esp. in the recognizer). • Make sure you manage changes to the system – especially low level changes. They should be discouraged once the system is deployed. • Establish “regression tests” – representative speech segments that the system should always process successfully, and check them. • Always keep several working generations of the system.

The state of the Art A few services represent the state-of-the-art: • United Airlines flight information at 1-800-864-8331 then 1,… • Fedex package rates 1-800-463-3339 • Tellme 1-800-555-1212, or 411 from Cingular wireless or Verizon landlines. • Wildfire: Speech phone services, voice dialing, messages, etc. 1-800-WILDFIRE • Also: Schwab, Wachovia, E-trade, B-of-A, Fidelity,…

Conversational Speech • CLERK:Make of car? • DRIVER: Uh Mercedes • CLERK: Model year? • DRIVER: It's a 1970. • CLERK: Color? Compare with this version: • CLERK: What's the make of your car? • DRIVER: Uh Mercedes • CLERK: OK. And the model year? • DRIVER: It's a 1970. • CLERK: Got it. What's the color?

Conversational Speech • The second version is both more polite and more usable. • System status (i.e. that it understood the user’s responses) is always clear. • Design of speech “character” should include normal human styling (politeness) but not excessive anthropomorphism. • In particular, the system should never suggest capabilities it does not have.

Conversational Speech Example System: This is the delivery tracking center. Tell me your four-digit delivery number or enter it on the keypad. Caller: 4-8-3-3 System: 4-8-3-3 Is that right? Caller: Yes. System: OK, hold on…(logs into system)…What's your status? You can say arrived, departed or delayed. Caller: I'll be delayed two days. There's a big storm. System: Oh, sorry to hear that! Let me confirm. I have delivery number 4-8-3-3 delayed for 48 hours due to weather. Is that right? Caller: Yes it is. System: Great. Hold on…OK. It's in the system. Hopefully you'll be on your way soon. I'll talk to you when you arrive. Drive safely.

Conversational Speech • Very good usability is possible through clever design. • It does not all depend on raw recognizer accuracy. • Careful design includes appropriate personality, giving enough flexibility to the user, and responding to errors carefully.

What’s happening now • Over the last half-dozen years, speech interfaces have gotten a lot better. • Most of the improvement seems to be due to improvements in method, i.e. iterative design, and heuristic guidelines like the ones just presented. • The field is a lot more interdisciplinary than it used to be, including speech engineers, UI designers and linguists.

The Future: Context-Awareness • Speech interfaces are rather limited today because they either rely on tightly constrained utterances, or on coarse language models. • In many cases, especially for mobile phones, there is a lot of constraint on what users might do from the context of use (time, location, meta-data on the phone) • Current research is using context data to improve recognition all the way down. Instead of general language models in the recognizer, you can “push down” context information into it. The recognizer can still recognize anything, but it will do better with more likely utterances.

Summary • Speech seems like a very good option for future computing environments. • Small devices can support speech interfaces, and microphone technology is getting better. • Speech UI design requires many of the same principles as general UI design, especially: • Visibility of system status • User control and freedom • Helping users recognize and recover from errors • Application of these principles leads to highly usable designs.

Speech: The Ultimate Interface?