Speech Technologies and VoiceXML

Speech Technologies and VoiceXML Chun-Feng Liao NCCU Department of Computer Science Intelligent Media Lab g9104@cs.nccu.edu.tw

Presentation Agenda • Voice technologies Backgrounds • ASR/TTS • Voice browsing with VoiceXML • VoiceXML architecture • VoiceXML Programming • Future of VoiceXML • Summary

Reference • [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CMP Books. • [2]Dave Raggett(2001),”Getting started with VoiceXML 2.0”,W3C. • [3]Sun Microsystems(1998),”Java Speech Grammar Format Specification v1.0”,Sun Microsystems. • [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:Strategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley. • [5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.

Reference • [6]Microsoft (2002),“Speech Technology Overview ” , http://www.microsoft.com/speech/evaluation/techover/ • [7] VoiceGenie Technologies Inc.(2001),”White Paper:Speaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc. • [8]W3C(2002),”VoiceXML Specification v2.0”,W3C.

Voice Technologies • In the mid- to late 1990s, personal computers started to become powerful enough to support ASR • The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).

Speech Recognition Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

Speech Synthesis Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )

Pervasive Computing Model • E-business has changed from client-server model to web-centric model • Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet. • Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.

Voice Browsing • VoiceXML instead of HTML • A voice browser instead of an ordinary web browser • Phone instead of PC.

VoiceXML Key Design Issues • Speech Input: speech recognition and DTMF • Speech Output: pre-recorded audio and synthesized speech • Internet: XML, IP, HTTP, SSL, JavaScript • Telephony: call transfer, data passing

W3C Voice Browser Working Group • Founded May 1999 • 60 company members • Mission — Standards group to prepare and review markup languages to enable internet-based speech applications • http://www.w3.org/Voice

VoiceXML Forum • Industry Group to promote VoiceXML • 550+ member companies • Submitted VoiceXML 1.0 to W3C in May 2000 • http://www.voicexml.org

VoiceXML v1.0 (May 2000) • VoiceXML Forum • Specification submitted to the W3C • VoiceXML v2.0 • W3C Voice Browser Working Group • 50+ members collaborating • Addressed 400+ change requests

VoiceXML Overview • A language for specifying voice dialogs. • Voice dialogs use audio prompts and text-to-speech (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input. • Main input/output device (initially) is the phone. • Leverages the Internet for application development and delivery. • Standard language enables portability.(VoiceXML統一了Dialog描述語言)

VoiceXML Platform Architecture

VoiceXML Platform Architecture-1 • Telephone and Telephone network-Connects caller’s telephone with Telephony Server • VoiceXML Gateway • Voice Browser • Audio input-Speech Recognition (ASR), Touchtone (DTMF), Audio recording. • Audio output-Audio playback, Speech Synthesis (TTS) • Interface, Call Controls

VoiceXML Platform Architecture-2 • VoiceXML Documents • Dialog and flow control • Client-side scripting (ECMAScript) • Speech Recognition grammar • Speech Synthesis pronunciation control • Document servers(web server) • Feeding Static VoiceXML documents or audio files. • Application servers • Generate VoiceXML documents dynamically. • Server-side application logic • Connect to Database, or database interface

Example <% user.storePreference(“try”) %> <form> <block>今天的氣溫是<%= weather.getTemp() %>度 </block></form> weather.jsp - VoiceXML and JSP VoiceXML-browser <form> <block>今天的氣溫是25度</block></form> DB Web server+ Servlet/JSP engine

Voice Gateway

Implementations of VoiceXML Gateways • In Taiwan: • Yes Mobile • Chunghwa Telecom Laboratories (二代語音平台) • eWings Technologies, Inc • Free • IBM VoiceServerSDK • Open Source • CMU:OpenVXI

[DEMO]A Simple VoiceXML Application

A VoiceXML document defines one or more dialogs The user is always in one dialog at any time Each dialog specifies the next dialog to transition to using a URL Document doc1.vxml Dialog 1 Transition: #dialog 2 Dialog 2 Transition: http://xyz.com/doc2.vxml

Dialog • A Dialog describes an interaction between a user and the system • Two kinds of dialogs: form and menu

VoiceXML Document Structure.

Form • Form會依照Grammar的定義，持續搜集filed中的資訊。 <form> <field name="travellers“> <grammar mode=“voice” src=“./number.grxml”/> <prompt>How many are travelling?</prompt> <filled> <submit next=”http://travel.com/order”/> </filled> </field> </form> input output eval

Menu <menu id=“commands”> What service would you like? <choice next=“/cars”> Car hire </choice> <choice next=“/hotels”> Hotel reservations </choice> <choice next=“/news”> Today’s news </choice> </menu> • menu其實就是沒有欄位的form • menu是一個流程控制的方式，依照user的選擇，分別傳送到不同URL。

Submit • Typically used to send results from client to server • Syntax:<submit next=”URI” namelist=”var1 var2 ...”/> • namelist:指定要傳到下一頁的Fields。

Submit, Example <form> <field name=“dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field> <field name="travellers“> <prompt> How many are travelling to <value expr="city"/>? </prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> <filled> Thank you. Your order is now being processed. <submit next="http://travel.com/order" namelist=“dest-city travellers"/> </filled> </form>

Variables • Variables can be manipulated and referenced • 宣告: <field name="user2"> • 設值: <assign name="user1" expr=”’peter’"/> • 清除: <clear namelist="user1 user2"/> • 引用: How many are travelling to <value expr=“dest-city”/> ? • 引用時不用加$

Variable Scope Session variables are ”read-only” variables provided by the interpreter context Scope defined by element containing executable content (<block>, <filled> or event handler) Search for variable name

錯誤處理:Events • Events are used to signal ”unexpected” situations • Events are caught by an catch event handler • <catch event=”com.acme.mailreader”>...</catch> • <catch event=”nomatch noinput”>...</catch> • Shortcut: <nomatch> is equivalent to <catch event="nomatch"> • Other shortcuts: <noinput>, <error>

Events, Example <field name=“dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> <nomatch> Please say the city you want to fly to. </nomatch> </field>

Multimodal Web Browsing • xHTML + VoiceXML • SALT

[DEMO]Multimodal Browsing

Future of the “Voice” web and VoiceXML Sun/SpeechWorks (1999) W3C VoiceXML 3? JSML Speech synthesis (SSML) JSGF Speech reco. grammar VoiceXML forum (2000) W3C (2003 - in CR) Speech semantics VoiceXML 1.0 VoiceXML 2.0 NLP Pronunciation lexicon[early] Call control [early] Voice Browser interoperation [early] Microsoft-led (2002) SALT Speech Application Language Tags

Conclusion • Speech is the most natural way for human to communicate thus it will become an important way in HCI. • VoiceXML has revolutionized speech recognition & telephony application development & deployment.

Q & A

Backup

History of VoiceXML Source:VoiceXML forum(http://www.voicexml.org)

Show : VoiceXML in Daily Life

Classification of Voice Application • Basic interactive voice response (IVR) • Computer: “For stock quotes, press 1. For trading, press 2. …” • Human: (presses DTMF “1”) • Basic speech ASR • C: “Say the stock name for a price quote.” • H: “Lucent Technologies”

Classification of Voice Application • Advanced speech ASR • C: “Stock Services, how may I help you?” • H: “Uh, what’s Lucent trading at?” • “Near-natural language” ASR • C: “How may I help you?” • H: “Um, yeah, I’d like to get the current price of Lucent Technologies” • C: “Lucent is up two at sixty eight and a half.” • H: “OK. I want to buy one hundred shares at market price.” • C: “…”

Speech Recognition • Capturing speech (analog) signals • Digitizing the sound waves, converting them to basic language units or phonemes, • Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).

Speech Synthesis • Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. • Breaking down the words into phonemes; • Analyzing for special handling of text such as numbers, currency amounts. • Generating the digital audio for playback.

VoiceXML Gateway(detail)

Programming VoiceXML • Writing a VoiceXML application is programming. • Control constructs are procedural (if-else etc.) • VoiceXML platform iterates through a <form> until values for all field items have been collected

Speech synthesis (TTS) Speech recognition (SR) Speech grammars Voice Biometrics VoiceXML System Components Telecom boards VoiceXML server Software utilities PBX VoiceXML servers serve as integrators of various hardware and software Call centre CT Integration

FIA - Form Interpretation Algorithm • The FIA has a main loop that repeatedly selects a form item and then visits it • The first (in document order) form item, whose field item variable is undefined, is selected • As a result, the user is prompted for each field item in turn

FIA – Form Example <form> <prompt>Where do you want to go to and how many are travelling ?</prompt> <field name=“dest-city"> <prompt>Where do you want to go to?</prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field> <field name="travellers”> <prompt>How many are travelling to your destination?</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field>  </form> Field item 1 Field item 2

Speech Technologies and VoiceXML