1.08k likes | 1.61k Views
Introduction to VoiceXML and Voice Web Architecture. Ken Rehor. Session Overview. Voice Web Architecture Components of a Voice Web Application Voice Standards W3C Speech Interface Framework VoiceXML Language features Execution model - Form Interpretation Algorithm (FIA)
E N D
Introduction to VoiceXML and Voice Web Architecture Ken Rehor © 2007 Ken Rehor. All Rights Reserved.
Session Overview • Voice Web Architecture • Components of a Voice Web Application • Voice Standards • W3C Speech Interface Framework • VoiceXML • Language features • Execution model - Form Interpretation Algorithm (FIA) • Application Design Techniques • Static vs. dynamic VoiceXML • Performance Considerations • CCXML, VoiceXML and VoIP • Application Deployment Models • New Technologies • Speaker Biometrics, Video, Multimodal, VoiceXML 3.0 © 2007 Ken Rehor. All Rights Reserved.
Simplifying Voice Services programming • Web-based architecture for interactive speech services • Exploit web technologies to simplify voice service creation and deployment • Enable consolidation of voice and web services • Separate service logic from user interaction • High-level programming languages • Control speech and telephony resources in uniform manner • Shield application programmers from implementation details • No need to know ASR, TTS, telephony APIs • Create portable applications • Run on enterprise system or in telephone network • Run on a variety of platforms, ASR agnostic © 2007 Ken Rehor. All Rights Reserved.
Voice Web Application Architecture © 2007 Ken Rehor. All Rights Reserved.
Key Ideas • Standard/Common high-level language • Designed for the task • Leverage open, known technology • Web protocols, servers, networks, development tools, expertise • Phone number mapped to URL • Phone number associated with URL of voice service © 2007 Ken Rehor. All Rights Reserved.
<grxml> Internet or Intranet <html> <vxml> .wav Web Browser Voice / Web Application Architecture PSTN orVoIP • Grammars • Audio files • Scripts Any phone HTTP VoiceXML browser HTTP HTTP Application (web) server • Application logic • Content and data • Transaction processing • Database interface • Images • Audio files • Scripts © 2007 Ken Rehor. All Rights Reserved.
VoiceXML interpreter <grxml> middleware ASR TTS Audio DTMF Telephony Customer service, please… Internet or intranet Welcome to Acme products … <vxml> .wav Voice Application Architecture and Components PSTN Caller HTTP VoiceXML platform Web server OA&M © 2007 Ken Rehor. All Rights Reserved.
Transaction Server Internet orIntranet Database (content) <vxml> Web service Application Backend Architecture • Grammars • Audio files • Scripts HTTP Intranet or Internet Application (web) server • Application logic • Content and data • Transaction processing • Database interface © 2007 Ken Rehor. All Rights Reserved.
Components of a Voice Solution • Traditional phone, VoIP phone, mobile phone, or multimodal device • Telephone network • Circuit-switched PSTN or packet-switched VoIP • Connects caller’s telephone with Telephony Server • Voice User Interface • Dialog structure / flow • Prompts – what the application says to the user • Speech grammars – what the user can say • Application logic that executes on an application server • Web "back-end“ • Database, or database interface • VoiceXML Server that executes dialogs • Controls resources such as ASR, SIV, TTS, etc • Data network to connect application server and VoiceXML server © 2007 Ken Rehor. All Rights Reserved.
Inbound or Outbound calls • VoiceXML application works the same for inbound and outbound calls • Additional call progress detection generally required for outbound • Simple protocol for initiating outbound calls • No firm standards, but most vendors follow similar techniques • HTTP, Web Services, etc. © 2007 Ken Rehor. All Rights Reserved.
Standards © 2007 Ken Rehor. All Rights Reserved.
Value of Open Standards • Non-proprietary interfaces between components • Allow choice of best components for the task • User interface languages • W3C Speech Interface Framework: VoiceXML, SRGS, SSML, SI • W3C: HTML, XHTML, SMIL, X+V • OMA: WAP • Communication protocols • W3C: CCXML for 3rd-party telephony call control • W3C: HTTP, HTTPS, SOAP, WSDL • IETF: SIP, MRCP, MSCP • 3GPP: IMS • ITU: T1, ISDN © 2007 Ken Rehor. All Rights Reserved.
Web app UI HTML – Structure Layout Input declaration Transitions Images Audio files / streams Video Text Scripts Voice Web app UI VoiceXML – Structure Dialog flow Input declaration Transitions Audio files Video, Images Text (for TTS) Scripts Visual vs. Voice markup © 2007 Ken Rehor. All Rights Reserved.
Web applications HTTP, HTTPS RTP SOAP WSDL … Voice Web applications HTTP, HTTPS RTP SOAP WSDL SIP … Protocols © 2007 Ken Rehor. All Rights Reserved.
Voice Standards Activities • Speech Interface Framework • Network protocols • SIP, MRCP v2, etc. • Platform Certification, Developer Certification,Speaker Biometrics, Architecture, Tools © 2007 Ken Rehor. All Rights Reserved.
Voice Application Standards CCXML Browser DTMF Audio VoIP Gateway Conference/MediaServer Media Mixer /Server ASR TTS SIV MRCP Client Server Server Server CCXML Call Control Application VoiceXMLApplication SIP NetannMSCML MOML / MSML MSCPDMSPMGCP etc. SOAP Scripts VXML GRXML CCXML HTTPHTTPS HTTPHTTPS Scripts Media ControlInterface Audio SSML Telephony ControlInterface DialogControlInterface PhoneNetwork SIP RFC 2833 RTP GRXML VoiceXML Browser T1 / E1ISDN SS7 G.711, WAV, .au, mp3, etc. VoiceXML 2.0 VoiceXML 2.1 ECMAScript 262 Caller Telephony Control Interface: SIP, etc. Dialog Control Interface: SIP, MSCP, etc. MRCP v1MRCP v2 M R C P GRXML SSML ** standards in progress ** © 2007 Ken Rehor. All Rights Reserved.
W3C Speech Interface Framework © 2007 Ken Rehor. All Rights Reserved.
Voice Application Components • Dialog – flow control of the inputs, outputs, next steps • Input grammars • Control input constraints for DTMF and speech recognition • Output formatting • Pronunciation, timing, sequencing © 2007 Ken Rehor. All Rights Reserved.
W3C Speech Interface Framework • VoiceXML • SRGS • SSML • Semantic Interpretation • Pronunciation Lexicon • Call Control For more information, see: W3C Voice Browser Working Group http://www.w3.org/Voice/ © 2007 Ken Rehor. All Rights Reserved.
Voice User Interface - Dialog • W3C VoiceXML 2.0 • W3C Recommendation March 2004 • Widely implemented • Approximately 4 dozen platforms • Many service providers worldwide • VoiceXML Forum certification program • Nearly two dozen certified platforms, more coming • W3C VoiceXML 2.1 • Candidate Recommendation Sept 2006 • Test suite under development; Certification Program to follow • Many platform vendors are implementing • W3C VoiceXML 3.0 • Early stages of development • SCXML – state chart markup language designed as a controller for V3 and CCXML 2.0 ("Working Draft" Jan 2006) © 2007 Ken Rehor. All Rights Reserved.
User Interaction – Input / Output Control • Input grammars W3C SRGS 1.0 • W3C Recommendation • Widely implemented • Output formatting W3C SSML 1.0 • W3C Recommendation • Widely implemented, yet minor real support (most TTS engines ignore the SSML instructions) • Semantic Interpretation for Speech Recognition W3C SISR 1.0 • Nearing Candidate Recommendation • Implementation gaining acceptance © 2007 Ken Rehor. All Rights Reserved.
W3C Speech Interface Framework Semantic Interpretation © 2007 Ken Rehor. All Rights Reserved.
W3C Speech Recognition Grammar Specification • Markup language to control input constraints • Finite-state speech recognition • DTMF recognition • Two variations • XML (GRXML) • ABNF • Version 1.0: W3C Recommendation – March 2004 • Implemented and supported by numerous vendors © 2007 Ken Rehor. All Rights Reserved.
GRXML ASR example • asdf <grammar type="application/srgs+xml" root="r2" version="1.0"> <rule id="r2" scope="public"> <one-of> <item>coffee</item> <item>tea</item> <item>milk</item> <item>nothing</item> </one-of> </rule> </grammar> © 2007 Ken Rehor. All Rights Reserved.
GRXML DTMF example <?xml version="1.0"?> <grammar mode="dtmf" version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar"> <rule id="digit"> <one-of> <item> 0 </item> <item> 1 </item> <item> 2 </item> <item> 3 </item> <item> 4 </item> <item> 5 </item> <item> 6 </item> <item> 7 </item> <item> 8 </item> <item> 9 </item> </one-of> </rule> <rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> # </item> </one-of> </rule> </grammar> © 2007 Ken Rehor. All Rights Reserved.
W3C Speech Synthesis Markup Language • Markup language to control spoken and audio output • Version 1.0: W3C Recommendation – Sept 2004 • Implemented and supported by numerous vendors • Version 1.1: under development • Adds support for tonal languages • First public Working Draft published January 2007 © 2007 Ken Rehor. All Rights Reserved.
SSML Functions • Audio output • <audio> • Text-to-Speech output • Contained within SSML constructs • Pronunciation controls • <say-as> • Interpret-as • Format • Detail • <emphasis> • Timing • <break> © 2007 Ken Rehor. All Rights Reserved.
SSML Functions (cont’d) • Spoken language • xml:lang • Prosody and Style – voice control • Voice • Gender • Age • Name • Prosody • <prosody> • Pitch • Contour • Range • Rate • Duration • Volume © 2007 Ken Rehor. All Rights Reserved.
SSML Functions (cont’d) • Sentence structure • <p> • <s> • phoneme -- Modify text • <sub> - substitute text • Location identification • <mark> © 2007 Ken Rehor. All Rights Reserved.
VoiceXML 2.x © 2007 Ken Rehor. All Rights Reserved.
VoiceXML Scope • Human-machine interaction provided by voice response systems: • Output • play audio files • produce synthesized speech • Input • record spoken input • recognize spoken input • collect character input • Control flow • Telephony • transfer a user to another destination, such as a live agent • disconnect a user © 2007 Ken Rehor. All Rights Reserved.
VoiceXML Goals • Separate user interaction from service logic • Creates new possible business models • Service developer can be separate from telephony platform provider • Enable service portability across implementation platforms • Assume common set of platform capabilities • Provide common language for: • Content providers, Tool providers, Platform providers • Safely handle sharednetwork-based applications • deterministic behavior • Easy to build common types of applications • Features to build complex types of applications • Shield application authors from low-level platform-specific details • Promotes portability, ease of service creation © 2007 Ken Rehor. All Rights Reserved.
VoiceXML 2.0 Basic Functions • Input • <field>, <menu> recognition • <record> audio recording • Output • <prompt> container for TTS or prerecorded audio • <audio> prerecorded audio • Control Flow • <if>, <else>, <elseif> basic conditional logic • <script> complex scripts using ECMAScript • <goto> transition to a new document • <submit> submit data to a web application • Telephony • <disconnect> • <transfer> © 2007 Ken Rehor. All Rights Reserved.
VoiceXML Execution Model • Form Interpretation Algorithm <form> • Execution is synchronous (mostly) • Disconnect events are handled (somewhat) asynchronously • Audio is queued • Played only when encountering a waiting state • Processing is always in one of two states: • Waiting for input in an input item • such as <field>, <record>, or <transfer> • Transitioning between input items in response to an input • Event-driven • <catch>, <throw> generalized event mechanism • <nomatch>, <noinput> short-hand user-input event handling • <error> short-hand error event handling © 2007 Ken Rehor. All Rights Reserved.
Key Points • Architecture leverages all things "internet" • Languages, protocols, servers, developers, etc. • Separation of concerns • Application logic / database vs. telephony / speech resources • Enables new business models • Voice ASP • Prepackaged applications • URL (application) associated with phone number • Calling party or Called party • Share resources among many applications (VoiceASP) • High-level languages, specific to domain / task • Simplify development and maintenance © 2007 Ken Rehor. All Rights Reserved.
VoiceXML<form>and<field> • <form> • Dialog container • "Form Interpretation Algorithm" (FIA) specifies default behavior • <field> • Collect input from caller • <grammar> specifies input 'constraints' • <prompt> • Container for <audio> and text © 2007 Ken Rehor. All Rights Reserved.
Example <?xml version="1.0"?> <vxml version="2.0"> <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> </vxml> main.vxml Note: Code simplified for demonstration purposes… © 2007 Ken Rehor. All Rights Reserved.
User Input - Grammars • Grammars can be speech or DTMF (touchtone) • Both types can be active simultaneously • Specified by SRGS • XML grammars are normative (aka GRXML) • ABNF grammars are more concise but more complex to author • Grammars may be specified inline or sourced externally • External grammars are referenced by URI • Multiple grammars may be active simultaneously. © 2007 Ken Rehor. All Rights Reserved.
Grammars can get very complicated:There are many ways to say the same thing… Sales I'd like to place an order I need to talk to a salesman Repair repair department service service department customer service Order status where's my order? track my order track my shipment where the hell is my stuff? © 2007 Ken Rehor. All Rights Reserved.
Basic GRXML grammar example <grammar …xml:lang="en-US" version="1.0"> <rule id="dept" scope="public"> <one-of> <item>sales</item> <item>repair</item> <item>order status</item> </one-of> </rule> </grammar> main_menu.grxml © 2007 Ken Rehor. All Rights Reserved.
VoiceXML example – next step <form> <field name="sales_menu"> <prompt> <audio src="sales_menu.wav"> You've reached Acme's sales department. To place an order, say sales. To speak to an associate, say I'd like to speak to someone. </audio> </prompt> <grammar src="sales_menu.grxml"/> </field> <block> <submit next="http://acme.com/... " method="get"/> </block> </form> sales.vxml © 2007 Ken Rehor. All Rights Reserved.
VoiceXML example with error handling <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <noinput> You must say something. </noinput> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> newmain.vxml © 2007 Ken Rehor. All Rights Reserved.
VoiceXML example with error handling <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> newmain.vxml © 2007 Ken Rehor. All Rights Reserved.
VoiceXML example with error handling <form> <field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field> <help> You can say sales, repair, or order status. </help> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch> <block> <submit next="http://acme.com/route... " method="get"/> </block> </form> newmain.vxml © 2007 Ken Rehor. All Rights Reserved.
Set platform features via <property> • Input modes: type of input from a caller DTMF-only<property name="inputmodes" value="dtmf"> Voice-only <property name="inputmodes" value="voice"> Both <property name="inputmodes" value="dtmf voice"> • Timeouts <property name="timeout" value="1450ms"> <property name="termtimeout" value="2500ms"> ... © 2007 Ken Rehor. All Rights Reserved.
Call processing: <transfer> • Blind • Go somewhere but don't return • Bridge • Add on another party, resume execution when done talking © 2007 Ken Rehor. All Rights Reserved.
Call processing: <transfer> • Blind transfer <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" > </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.
Call processing: <transfer> • Bridge transfer <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.
Call processing: <transfer> • Bridge transfer with cancel feature <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/> </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.
Call processing: <transfer> <form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block> <transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/> <filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try again later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled> </transfer> </form> © 2007 Ken Rehor. All Rights Reserved.