Introduction to VoiceXML and Voice Web Architecture

Session Overview • Voice Web Architecture • Components of a Voice Web Application • Voice Standards • W3C Speech Interface Framework • VoiceXML • Language features • Execution model - Form Interpretation Algorithm (FIA) • Application Design Techniques • Static vs. dynamic VoiceXML • Performance Considerations • CCXML, VoiceXML and VoIP • Application Deployment Models • New Technologies • Speaker Biometrics, Video, Multimodal, VoiceXML 3.0 © 2007 Ken Rehor. All Rights Reserved.

Simplifying Voice Services programming • Web-based architecture for interactive speech services • Exploit web technologies to simplify voice service creation and deployment • Enable consolidation of voice and web services • Separate service logic from user interaction • High-level programming languages • Control speech and telephony resources in uniform manner • Shield application programmers from implementation details • No need to know ASR, TTS, telephony APIs • Create portable applications • Run on enterprise system or in telephone network • Run on a variety of platforms, ASR agnostic © 2007 Ken Rehor. All Rights Reserved.

Key Ideas • Standard/Common high-level language • Designed for the task • Leverage open, known technology • Web protocols, servers, networks, development tools, expertise • Phone number mapped to URL • Phone number associated with URL of voice service © 2007 Ken Rehor. All Rights Reserved.

<grxml> Internet or Intranet <html> <vxml> .wav Web Browser Voice / Web Application Architecture PSTN orVoIP • Grammars • Audio files • Scripts Any phone HTTP VoiceXML browser HTTP HTTP Application (web) server • Application logic • Content and data • Transaction processing • Database interface • Images • Audio files • Scripts © 2007 Ken Rehor. All Rights Reserved.

VoiceXML interpreter <grxml> middleware ASR TTS Audio DTMF Telephony Customer service, please… Internet or intranet Welcome to Acme products … <vxml> .wav Voice Application Architecture and Components PSTN Caller HTTP VoiceXML platform Web server OA&M © 2007 Ken Rehor. All Rights Reserved.

Transaction Server Internet orIntranet Database (content) <vxml> Web service Application Backend Architecture • Grammars • Audio files • Scripts HTTP Intranet or Internet Application (web) server • Application logic • Content and data • Transaction processing • Database interface © 2007 Ken Rehor. All Rights Reserved.

Components of a Voice Solution • Traditional phone, VoIP phone, mobile phone, or multimodal device • Telephone network • Circuit-switched PSTN or packet-switched VoIP • Connects caller’s telephone with Telephony Server • Voice User Interface • Dialog structure / flow • Prompts – what the application says to the user • Speech grammars – what the user can say • Application logic that executes on an application server • Web "back-end“ • Database, or database interface • VoiceXML Server that executes dialogs • Controls resources such as ASR, SIV, TTS, etc • Data network to connect application server and VoiceXML server © 2007 Ken Rehor. All Rights Reserved.

Inbound or Outbound calls • VoiceXML application works the same for inbound and outbound calls • Additional call progress detection generally required for outbound • Simple protocol for initiating outbound calls • No firm standards, but most vendors follow similar techniques • HTTP, Web Services, etc. © 2007 Ken Rehor. All Rights Reserved.

Value of Open Standards • Non-proprietary interfaces between components • Allow choice of best components for the task • User interface languages • W3C Speech Interface Framework: VoiceXML, SRGS, SSML, SI • W3C: HTML, XHTML, SMIL, X+V • OMA: WAP • Communication protocols • W3C: CCXML for 3rd-party telephony call control • W3C: HTTP, HTTPS, SOAP, WSDL • IETF: SIP, MRCP, MSCP • 3GPP: IMS • ITU: T1, ISDN © 2007 Ken Rehor. All Rights Reserved.

Web app UI HTML – Structure Layout Input declaration Transitions Images Audio files / streams Video Text Scripts Voice Web app UI VoiceXML – Structure Dialog flow Input declaration Transitions Audio files Video, Images Text (for TTS) Scripts Visual vs. Voice markup © 2007 Ken Rehor. All Rights Reserved.

Voice Standards Activities • Speech Interface Framework • Network protocols • SIP, MRCP v2, etc. • Platform Certification, Developer Certification,Speaker Biometrics, Architecture, Tools © 2007 Ken Rehor. All Rights Reserved.

Voice Application Standards CCXML Browser DTMF Audio VoIP Gateway Conference/MediaServer Media Mixer /Server ASR TTS SIV MRCP Client Server Server Server CCXML Call Control Application VoiceXMLApplication SIP NetannMSCML MOML / MSML MSCPDMSPMGCP etc. SOAP Scripts VXML GRXML CCXML HTTPHTTPS HTTPHTTPS Scripts Media ControlInterface Audio SSML Telephony ControlInterface DialogControlInterface PhoneNetwork SIP RFC 2833 RTP GRXML VoiceXML Browser T1 / E1ISDN SS7 G.711, WAV, .au, mp3, etc. VoiceXML 2.0 VoiceXML 2.1 ECMAScript 262 Caller Telephony Control Interface: SIP, etc. Dialog Control Interface: SIP, MSCP, etc. MRCP v1MRCP v2 M R C P GRXML SSML ** standards in progress ** © 2007 Ken Rehor. All Rights Reserved.

Voice Application Components • Dialog – flow control of the inputs, outputs, next steps • Input grammars • Control input constraints for DTMF and speech recognition • Output formatting • Pronunciation, timing, sequencing © 2007 Ken Rehor. All Rights Reserved.

W3C Speech Interface Framework • VoiceXML • SRGS • SSML • Semantic Interpretation • Pronunciation Lexicon • Call Control For more information, see: W3C Voice Browser Working Group http://www.w3.org/Voice/ © 2007 Ken Rehor. All Rights Reserved.

Voice User Interface - Dialog • W3C VoiceXML 2.0 • W3C Recommendation March 2004 • Widely implemented • Approximately 4 dozen platforms • Many service providers worldwide • VoiceXML Forum certification program • Nearly two dozen certified platforms, more coming • W3C VoiceXML 2.1 • Candidate Recommendation Sept 2006 • Test suite under development; Certification Program to follow • Many platform vendors are implementing • W3C VoiceXML 3.0 • Early stages of development • SCXML – state chart markup language designed as a controller for V3 and CCXML 2.0 ("Working Draft" Jan 2006) © 2007 Ken Rehor. All Rights Reserved.

User Interaction – Input / Output Control • Input grammars W3C SRGS 1.0 • W3C Recommendation • Widely implemented • Output formatting W3C SSML 1.0 • W3C Recommendation • Widely implemented, yet minor real support (most TTS engines ignore the SSML instructions) • Semantic Interpretation for Speech Recognition W3C SISR 1.0 • Nearing Candidate Recommendation • Implementation gaining acceptance © 2007 Ken Rehor. All Rights Reserved.

W3C Speech Recognition Grammar Specification • Markup language to control input constraints • Finite-state speech recognition • DTMF recognition • Two variations • XML (GRXML) • ABNF • Version 1.0: W3C Recommendation – March 2004 • Implemented and supported by numerous vendors © 2007 Ken Rehor. All Rights Reserved.

GRXML ASR example • asdf <grammar type="application/srgs+xml" root="r2" version="1.0"> <rule id="r2" scope="public"> <one-of> <item>coffee</item> <item>tea</item> <item>milk</item> <item>nothing</item> </one-of> </rule> </grammar> © 2007 Ken Rehor. All Rights Reserved.

GRXML DTMF example <?xml version="1.0"?> <grammar mode="dtmf" version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="digit"> <one-of> <item> 0 </item> <item> 1 </item> <item> 2 </item> <item> 3 </item> <item> 4 </item> <item> 5 </item> <item> 6 </item> <item> 7 </item> <item> 8 </item> <item> 9 </item> </one-of> </rule> <rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> # </item> </one-of> </rule> </grammar> © 2007 Ken Rehor. All Rights Reserved.

W3C Speech Synthesis Markup Language • Markup language to control spoken and audio output • Version 1.0: W3C Recommendation – Sept 2004 • Implemented and supported by numerous vendors • Version 1.1: under development • Adds support for tonal languages • First public Working Draft published January 2007 © 2007 Ken Rehor. All Rights Reserved.

SSML Functions • Audio output • <audio> • Text-to-Speech output • Contained within SSML constructs • Pronunciation controls • <say-as> • Interpret-as • Format • Detail • <emphasis> • Timing • <break> © 2007 Ken Rehor. All Rights Reserved.

SSML Functions (cont’d) • Spoken language • xml:lang • Prosody and Style – voice control • Voice • Gender • Age • Name • Prosody • <prosody> • Pitch • Contour • Range • Rate • Duration • Volume © 2007 Ken Rehor. All Rights Reserved.

VoiceXML Scope • Human-machine interaction provided by voice response systems: • Output • play audio files • produce synthesized speech • Input • record spoken input • recognize spoken input • collect character input • Control flow • Telephony • transfer a user to another destination, such as a live agent • disconnect a user © 2007 Ken Rehor. All Rights Reserved.

VoiceXML Goals • Separate user interaction from service logic • Creates new possible business models • Service developer can be separate from telephony platform provider • Enable service portability across implementation platforms • Assume common set of platform capabilities • Provide common language for: • Content providers, Tool providers, Platform providers • Safely handle sharednetwork-based applications • deterministic behavior • Easy to build common types of applications • Features to build complex types of applications • Shield application authors from low-level platform-specific details • Promotes portability, ease of service creation © 2007 Ken Rehor. All Rights Reserved.

VoiceXML 2.0 Basic Functions • Input • <field>, <menu> recognition • <record> audio recording • Output • <prompt> container for TTS or prerecorded audio • <audio> prerecorded audio • Control Flow • <if>, <else>, <elseif> basic conditional logic • <script> complex scripts using ECMAScript • <goto> transition to a new document • <submit> submit data to a web application • Telephony • <disconnect> • <transfer> © 2007 Ken Rehor. All Rights Reserved.

VoiceXML Execution Model • Form Interpretation Algorithm <form> • Execution is synchronous (mostly) • Disconnect events are handled (somewhat) asynchronously • Audio is queued • Played only when encountering a waiting state • Processing is always in one of two states: • Waiting for input in an input item • such as <field>, <record>, or <transfer> • Transitioning between input items in response to an input • Event-driven • <catch>, <throw> generalized event mechanism • <nomatch>, <noinput> short-hand user-input event handling • <error> short-hand error event handling © 2007 Ken Rehor. All Rights Reserved.

Key Points • Architecture leverages all things "internet" • Languages, protocols, servers, developers, etc. • Separation of concerns • Application logic / database vs. telephony / speech resources • Enables new business models • Voice ASP • Prepackaged applications • URL (application) associated with phone number • Calling party or Called party • Share resources among many applications (VoiceASP) • High-level languages, specific to domain / task • Simplify development and maintenance © 2007 Ken Rehor. All Rights Reserved.

VoiceXML<form>and<field> • <form> • Dialog container • "Form Interpretation Algorithm" (FIA) specifies default behavior • <field> • Collect input from caller • <grammar> specifies input 'constraints' • <prompt> • Container for <audio> and text © 2007 Ken Rehor. All Rights Reserved.

Example <?xml version="1.0"?> <vxml version="2.0"> <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> </vxml> main.vxml Note: Code simplified for demonstration purposes… © 2007 Ken Rehor. All Rights Reserved.

User Input - Grammars • Grammars can be speech or DTMF (touchtone) • Both types can be active simultaneously • Specified by SRGS • XML grammars are normative (aka GRXML) • ABNF grammars are more concise but more complex to author • Grammars may be specified inline or sourced externally • External grammars are referenced by URI • Multiple grammars may be active simultaneously. © 2007 Ken Rehor. All Rights Reserved.

Grammars can get very complicated:There are many ways to say the same thing… Sales I'd like to place an order I need to talk to a salesman Repair repair department service service department customer service Order status where's my order? track my order track my shipment where the hell is my stuff? © 2007 Ken Rehor. All Rights Reserved.

Basic GRXML grammar example <grammar …xml:lang="en-US" version="1.0"> <rule id="dept" scope="public"> <one-of> <item>sales</item> <item>repair</item> <item>order status</item> </one-of> </rule> </grammar> main_menu.grxml © 2007 Ken Rehor. All Rights Reserved.

VoiceXML example – next step <form> <field name="sales_menu"> <prompt> <audio src="sales_menu.wav"> You've reached Acme's sales department. To place an order, say sales. To speak to an associate, say I'd like to speak to someone. </audio> </prompt> <grammar src="sales_menu.grxml"/> </field> <block> <submit next="http://acme.com/... " method="get"/> </block> </form> sales.vxml © 2007 Ken Rehor. All Rights Reserved.

VoiceXML example with error handling <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <noinput> You must say something. </noinput> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> newmain.vxml © 2007 Ken Rehor. All Rights Reserved.

VoiceXML example with error handling <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> newmain.vxml © 2007 Ken Rehor. All Rights Reserved.

VoiceXML example with error handling <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <help> You can say sales, repair, or order status. </help> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> newmain.vxml © 2007 Ken Rehor. All Rights Reserved.

Set platform features via <property> • Input modes: type of input from a caller DTMF-only<property name="inputmodes" value="dtmf"> Voice-only <property name="inputmodes" value="voice"> Both <property name="inputmodes" value="dtmf voice"> • Timeouts <property name="timeout" value="1450ms"> <property name="termtimeout" value="2500ms"> ... © 2007 Ken Rehor. All Rights Reserved.

Call processing: <transfer> • Blind transfer <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" > </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.

Call processing: <transfer> • Bridge transfer <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.

Call processing: <transfer> • Bridge transfer with cancel feature <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/> </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.

Call processing: <transfer> <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/> <filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try again later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled> </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.

Introduction to VoiceXML and Voice Web Architecture

Introduction to VoiceXML and Voice Web Architecture

Presentation Transcript

Introduction to Wireless: Voice and Data

Introduction to Architecture

XML Web Services: VoiceXML and Phone Directories

Introduction to Architecture

Introduction to Semantic Web Service Architecture

Introduction to Architecture

Introduction to Web Architecture

Introduction to Web Architecture

Introduction to Architecture

Introduction to Architecture

Introduction to Voice Conversion

VoiceXML and VoIP

Introduction to Voice

VoiceXML

VoiceXML

Introduction to VoiceXML 2.0

The Voice-Enabled Web: VoiceXML and Related Standards for Telephone Access to Web Applications

VoiceXML

Speech Technologies and VoiceXML

VoiceXML Technology