260 likes | 407 Views
Avoiding the Pitfalls of Speech Application Rollouts Through Testing and Production Management. Rob Edmondson, Senior Field Engineer Empirix, Inc. Overview.
E N D
Avoiding the Pitfalls of Speech Application Rollouts Through Testing and Production Management Rob Edmondson, Senior Field Engineer Empirix, Inc.
Overview You are about to deploy a new call center, with new PBX, Speech enabled IVR deployed on VXML architecture, post-routing CTI, and 200 agent stations with IP phones … • What is the customer perceived latency for your IVR to respond to callers’ speech inputs? • What is the average host connection latency for the IVR? • What percentage of caller’s utterances are recognized the first time? • What percentage of calls fail to be completed in the IVR because of application errors? • How many calls fail to be routed to the correct agent or skill group? • What is the average time it takes for screen pop to occur? • What percentage of screen pops have missing or incorrect information? • What percentage of screen pops never happen? • What is the voice quality for the agent and caller? • What is the impact on other users of your CRM system? At 5 Calls/Minute? At 30 Calls/Minute? At Maximum call load?
Business Goals Driving Self-Service… ...While Quality Strategies Focused on Agents. 100% 80% Handled by Agents 60% Handled by Self-Service 40% 20% Source: Enterprise Integration Group 2004 0% Utilities Telecom Mortgage Credit Card Stock/Mutual Retail Banking Health Insurance
Quality of Customer Experience Design Delivery Speech Application Quality – Design and Delivery Quality Evaluation Matrix • Easy to Use • Unpredictable Behavior • Easy to Use • Behaves as Designed • Difficult to Use • Behaves as Designed • Difficult to Use • Unpredictable Behavior
Can I just ‘speechify’ my DTMF apps? Should I allow DTMF input? What voice should we use? How personal should the application be? Should I allow barge-in? Which utterances should I allow for a recognition state? How do I handle error conditions? When do I transfer to an agent? How do I test speech? Do I have enough speech/TTS resources? Do I need to test with different accents? How do I do usability testing? Will VoIP impact my speech recognition accuracy? How do I verify TTS quality? How do I make sure it’s working after we go into production? Common Questions When Deploying Speech Design Delivery
Speech Testing* • Recognition Testing: Evaluates recognizer performance. Callers generate utterances by talking to the application, using test scripts. • male and female speakers • different dialects • different noise conditions • Accuracy is measured by comparing the recognition results to a transcription of the utterances. • barge-in, speaker verification, subscriber profiles and dynamic grammars, should also be tested for accuracy with a variety of speakers and calling conditions Usability Testing: Conducted early in the design process and is also helpful at this stage to validate the performance of an application against the metrics laid out in the requirements phase • Application Testing: • Dialog Traversal: creates and executes a series of test cases to cover all possible paths through the dialog to verify: • that the right prompts are played • each state in the call flow is reached correctly • ensure the universal, error, and help behaviors are operational • System Load: simulates a high in-bound call volume to ensure that: • expected caller capacity can be handled • proper load balancing occurs across the system. • Tuning and Monitoring: • Ongoing analysis of real caller interactions. This occurs during • Pilot deployment (beta) • Post-Deployment • Ongoing Monitoring *Nuance Project Method *Introduction to the Nuance System, v8.5, pg 72
Testing During the Lifecycle Requirements Usability Design Recognition Implementation Application Testing Performance Deployment Tuning
Usability Testing – A Key to Success “Usability testing is sometimes confused with quality assurance (QA), but the two are very different. QA usually measures a product’s performance against its specifications. For example, QA on an automobile would ensure that the components function as specified, that the gaps between the doors and the body are within tolerances, and so forth. QA testing would not determine whether a vehicle is easy for people to operate, but usability testing would. In a speech application, QA ensure that the appropriate prompts do in fact play at the right times in the right order. This kind of testing is important, because designers generally shouldn’t assume that an application will work to ‘spec’. QA testing can tell us a great deal about a system’s functionality. But it can’t tell us if the target population for the application can use it –or will like to use it.” - Blade Kottely, The Art and Business of Speech Recognition, pg. 122 “Usability testing is just as important for simple DTMF applications as it is for complex NL (natural language) applications. In general, the more control the user has over the application, the more testing will be required and the more valuable this testing will be…. The subject is a complex one, and both designers and developers are encouraged to develop formal, documented test plans early in the product life cycle.” - Bruce Balentine and David P. Morgan, How to Build a Speech Recognition Application, 2nd Edition, pg. 294
Recognition Testing - Useful Metrics • First Time Recognition rate: • For a known good input prompt, what percentage of the time is the expected prompt heard back • Timeout and Rejection rates: • For timeout and invalid input tests, how often is the correct behavior observed? • Barge-in detection rate: • When barging in at an acceptable time, what percentage of time is the speech detected • Menu response latency • How long after the end of input utterance does it take for the next prompt to begin
Dialog State Testing ‘Dashboard’ Dialog State: GetPizzaSize Error Handling First Time Recognition Rate Response Time Data Tester Comments • Dialog state performs very well • Still need to test universal behaviors (Help, Main Menu) • Used clip ‘nothing.wav’ for Reject tests - around 10% of calls came up with Medium instead of correct rejection Barge In Success Data
ACD Desktop App(s) PBX PSTN CTI T1, E1, PRI, SIP H.323,… Telephony Server (s) Application Server(s) Backend Applications (CRM, etc) Web Servers ASR/TTS Server (s) Performance Testing - System Overview Telephony Infrastructure Agents Callers Application Infrastructure IVR/ Speech Platform VoiceXML MRCP
Example configuration and vendors Nuance, Scansoft, IBM, Microsoft, Loquendo, … ASR, TTS VoiceXML 2.0 MRCP Web Server Nortel, Avaya, Genesys, IVB, Edify, IBM, Aspect, Syntellect, Nuance, VoiceGenie,… BEA, IBM, Sun, Oracle, Microsoft, OpenSource VoiceXML Platform CCXML SIP Excel, AudioCodes, Voxeo, IVB, Cisco, Genesys, Avaya, … Call Control/ Media Server SIP,H.323 Cisco, Avaya, Nortel, VegaStream, … (Gateway) JTAPI,… T1, E1 PRI,… Network/PBX CTI Server Genesys, Avaya, Nortel, Cisco, Apropos, … Avaya, Nortel, Intertel, NEC, Cisco, Siemens, ….. ACD Avaya, Nortel, Cisco, Apropos, II, Siemens, …..
Performance Testing Load Test Objectives • Application can handle expected load • Find System bottlenecks • Find pre-failure indicators • Understand recovery procedures Considerations • ‘component’ load tests to isolate specific pieces • Test lab or Production? • Emulate real-world call patterns • Iterative testing allows ‘find and fix’ • Go beyond what you expect in production • compare recognition rates at increasing load levels
Performance Testing • Key Metrics • Customer perceived latency at each step • The time from end of caller input to the beginning of the next response, which is ‘dead air’ to the caller • Time to Complete the Call (call length) • Transactional Completion Rate • First time recognition rate • All of these metrics relative to call load • Why are these important? • Direct measures of caller’s quality of experience • Cost implications to the enterprise • Cost of variability • Self service versus assisted help • Quantify an otherwise subjective idea
Production Management • Tuning/Monitoring – Vendor Tools • Application Monitoring • 3rd party tools for device/application monitoring • Proactive call transactions • Key Metrics for Customer Experience • Latencies • Transactional errors • Speech recognition success rates
Min/Max 5th-95th % Customer Perceived Latencies
Transaction Failures By Time of Day *excludes retry calls
Can I just ‘speechify’ my DTMF apps? Should I allow DTMF input? What voice should we use? How personal should the application be? Should I allow barge-in? Which utterances should I allow for a recognition state? How do I handle error conditions? When do I transfer to an agent? How do I test speech? Do I have enough speech/TTS resources? Do I need to test with different accents? How do I do usability testing? Will VoIP impact my speech recognition accuracy? How do I verify TTS quality? How do I make sure it’s working after we go into production? Review: Common Questions Design Delivery
Review: You are about to deploy a new call center, with new PBX, Speech enabled IVR deployed on VXML architecture, post-routing CTI, and 200 agent stations with IP phones … • What is the customer perceived latency for your IVR to respond to callers’ speech inputs? • What is the average host connection latency for the IVR? • What percentage of caller’s utterances are recognized the first time? • What percentage of calls fail to be completed in the IVR because of application errors? • How many calls fail to be routed to the correct agent or skill group? • What is the average time it takes for screen pop to occur? • What percentage of screen pops have missing or incorrect information? • What percentage of screen pops never happen? • What is the voice quality for the agent and caller? • What is the impact on other users of your CRM system? At 5 Calls/Minute? At 30 Calls/Minute? At Maximum call load?
Rob Edmondson Empirix, Inc. redmondson@empirix.com 916-781-9873