410 likes | 499 Views
With thanks to Jim Larson. From Voice Browsers to Multimodal Systems. The W3C Speech Interface Framework. http://www.w3.org/Voice. Dave Raggett W3C Lead for Voice/Multimodal W3C & Openwave dsr@w3.org. Voice – The Natural Interface available from over a billion phones.
E N D
With thanks to Jim Larson From Voice Browsers to Multimodal Systems The W3C Speech Interface Framework http://www.w3.org/Voice Dave Raggett W3C Lead for Voice/Multimodal W3C & Openwave dsr@w3.org
Voice – The Natural Interfaceavailable from over a billion phones • Personal assistant functions: • Name dialing and Search • Personal Information Management • Unified Messaging (mail, Fax & IM) • Call screening & call routing • Voice Portals • Access to news, information, entertainment, customer service and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment) • Front-ends for Call Centers • 90% cost savings over human agents • Reduced call abandonment rates (IVR) • Increased customer satisfaction (Portal Demo)
W3C Voice Browser Working Grouphttp://www.w3.org/Voice/Group • Founded: May 1999 following workshop in October 1998 • Mission • Prepare and review markup languages to enable Internet-based speech applications • Has published requirements and specifications for languages in the W3C Speech Interface Framework • Is now due to be re-chartered with clarified IP policy
W3C Speech Interface Framework N-gram Grammar ML Natural Language Semantics ML VoiceXML 2.0 Speech Recognition Grammar ML ASR Language Understanding Dialog Manager World Wide Web Context Interpretation DTMF Tone Recognizer Lexicon Telephone System Prerecorded Audio Player Media Planning User TTS Language Generation Speech Synthesis ML Reusable Components Call Control
W3C Speech Interface Framework Published Documents Documents available at http://www.w3.org/Voice REC PR CR LCWD WD REQ Soon 1-01 1-01 Soon 12-99 12-99 12-99 5-00 12-99 12-99 12-99 12-99 12-99 5-00 2-01 4-01 Dialog Speech Speech N-gram NL Reusable Lexicon Call Synthesis Grammar Semantics Comp'ts Control
Voice User Interfaces and VoiceXML • Why use voice as a user interface? • Far more phones than PCs • More wireless phones than PCs • Hands and eyes free operation • Why do we need a language for specifying voice dialogs? • High-level language simplifies application development • Separates Voice interface from Application server • Leverage existing Web application development tools • What does VoiceXML describe? • Conversational dialogs: System and user turns to speak • Dialogs based on form-filling metaphor plus events and links • W3C is standardizing VoiceXML based upon VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola
VoiceXML Architecture Brings the power of the Web to Voice VoiceXML Gateway Consumer or Corporate Web site Any Phone PSTN or VoIP VoiceXMLGrammarsAudio files Speech +DTMF Corporation Carrier
Reaching Out to Multiple Channels Applications Database XML, Images, Audio, … Content Adaptation Adjust as needed for each device & user XHTML VoiceXML WML/HDML
VoiceXML Features • Menus, Forms, Sub-dialogs • <menu>, <form>, <subdialog> • Inputs • Speech Recognition <grammar> • Recording <record> • Keypad <dtmf> • Output • Audio files <audio> • Text-To-Speech • Variables • <var>, <script> • Events • <nomatch>, <noinput>, <help>, <catch>, <throw> • Transition & submission • <goto>, <submit> • Telephony • Call transfer • Telephony information • Platform • Objects • Performance • Fetch
<menu> <prompt> <speak> Welcome to Ajax Travel. Do you want to fly to <emphasis> New York </emphasis> or <emphasis> Washington </emphasis> </speak> </prompt> <choice next="http://www.NY...".><grammar> <choice> <item> New York </item> <item> Big Apple </item> </choice> </grammar> </choice> <choice next="http://www.Wash..."> <grammar> <choice> <item> Washington </item> <item> The Capital </item> </choice> </grammar> </choice> </menu> Example VoiceXML
Example VoiceXML <form id="weather_info"> <block>Welcome to the international weather service.</block> <field name=“country"> <prompt>What country?</prompt> <grammar src=“country.gram" type="application/x-jsgf"/> <catch event="help"> Please say the country for which you want the weather. </catch> </field> <field name="city"> <prompt>What city?</prompt> <grammar src="city.gram" type="application/x-jsgf"/> <catch event="help"> Please say the city for which you want the weather. </catch> </field> <block> <submit next="/servlet/weather" namelist="city country"/> </block> </form>
VoiceXML Implementations See http://www.w3.org/Voice • BeVocal • General Magic • HeyAnita • IBM • Lucent • Motorola • Nuance • PipeBeach • SpeechWorks • Telera • Tellme • Voice Genie These are the companies who asked to be listed on the W3C Voice page
Reusable Components Voice Application Developer Voice Application Developer Reusable Components VoiceXML Scripts Dialog Manager
Reusable Dialog Modules • Express application at task level rather than interaction level • Save development time by reusing tried and effective modules • Increase consistency among applications Examples include: • Credit card number • Date • Name • Address • Telephone number • Yes/No question • Shopping cart • Order status • Weather • Stock quotes • Sport scores • Word games
Speech Grammar ML • Specifies the words and patterns of words for which a speaker independent recognizer can listen • May be specified • Inline as part of a VoiceXML page • Referenced and stored separately on Web servers • Three variants: XML, ABNF, N-Gram • Action Tags for “semantic processing”
XML Modeled after Java Speech Grammar Format Mandatory for Dialog ML interpreters Manually specified by developer Augmented BNF syntax (ABNF) Modeled after Java Speech Grammar Format Optional for Dialog ML interpreters May be mapped to and from XML grammars Manually specified by developer N-grams Optional for Dialog ML interpreters Used for larger vocabularies Generated statistically Three forms of the Grammar ML <rule id="state"scope="public"> <one-of> <item> Oregon </item> <item>Maine </item> </one-of> </rule> public$state = Oregon | Maine
Action Tags • Specify what VoiceXML variables to set when grammar rules are matched to user input • Based upon subset of ECMAScript $drink = coke | pepsi | coca cola {"coke"}; // medium is default if nothing said $size = {"medium"} [small | medium | large | regular {"medium"}]
N-Gram Language Models • Likelihood of a given word following certain others • Used as a linguistic model to identify most likely sequence of words that matches the spoken input • N-Grams are computed automatically from a corpus of many inputs • The N-Gram Markup Language is used as interchange format for automatic analysis of words and phrases to an dictation ASR engine.
Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass. Speech synthesis process modeled after Sun’s Java Speech Markup Language Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production IN OUT • Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.
Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production <paragraph> <sentence> This is the first sentence. </sentence> <sentence> This is the second sentence. </sentence> </paragraph> Non-markup behavior: infer structure by automated text analysis Markup support: paragraph, sentence
Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Non-markup behavior: automatically identify and convert constructs Markup support: sayas for dates, times, etc. Examples <sayas sub="World Wide Web Consortium" > W3C</sayas> <sayas type="number:digits"> 175 </sayas>
Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Non-markup behavior: look up in a pronunciation dictionary Markup support: phoneme, sayas • Phonetic Alphabets • International Phonetic Alphabet • Worldbet • X-SAMPA International Phonetic Alphabet (IPA) using character entities Example <phoneme alphabet="ipa" ph="tɒmɑtoʊ"> tomato</phoneme>
Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Examples <emphasis> Hi </emphasis> <break time="3s"/> <prosody rate="slow"/> Prosody element pitch: high, medium, low, default contour range: high, medium, low, default rate: fast medium, slow, default volume: silent, soft medium, loud, default Non-markup behavior: automatically generates prosody through analysis of document structure and sentence syntax Markup support: emphasis, break, prosody
Speech Synthesis ML Structure Analysis Text Normali- zation Text-to- Phoneme Conversion Prosody Analysis Waveform Production Examples <audio src=“laughter.wav">[laughter]</audio> <voice age="child"> Mary had a little lamb </voice> Attributes gender: male, female, neutral age: child, teenager, adult, elder, (integer) variant: different, (integer) name: default, (voice-name) Markup support: voice, audio
<lexicon> either /iy th r/ either /ay th r/ </lexicon> LexiconML - Why? • Accurate pronunciations are essential in EVERY speech application • Platform default lexicons do not give 100% coverage of user speech Voice Application Developer ASR either either TTS /ay th r/ /iy th r/ /ay th r/ Pronunciation Lexicon
LexiconML - Key Requirements • Meets both synthesis and recognition requirements • Pronunciations for any language (including tonal) • reuse standard alphabets, support for suprasegmentals • Multiple pronunciations per word • Alternate orthographies • Spelling variations — “colour” and “color” • Alternative writing systems —Japanese Kanji and Kana • Abbreviations and Acronyms - e.g. Dr., BT, • Homophones e.g “read” and “reed” (same sound) • Homographs e.g. “read” and “read” (same spelling)
Interaction Style • Voice user interfaces needn't be dull • Choose prompts to reflect an explicit choice of personality • Introduce variety in prompts rather than always repeating the same thing • Politeness, helpfulness and sense of humor • Target different groups of users e.g. Gen Y • Allow users to select personality (skin) (Personality Demo)
Call Control Voice Application Developer Dialog Manager Voice XML Call Control User (Call control Demo)
Call Control Requirements • Call management—Place outbound call, conditionally answer inbound call, outbound fax • Call leg management—Create, redirect, interact while on hold • Conference management—Create, join, exit • Intersession communication—Asynchronous events • Interpreter context—Invoke, terminate
Natural Language Semantics ML Voice Application Developer Grammar and semantic tags ASR Language Understanding Context Interpretation Text NL Semantics
Natural Language Semantics ML • Represent semantic interpretations of an utterance • Speech • Natural language text • Other forms (e.g., handwriting, ocr, DTMF.) • Used primarily as an interchange format among voice browser components • Usually generated automatically and not authored directly by developers • Goal is to use XForms as a data model
NLSemantics ML structure confidence grammar x-model xmlns grammar x-model xmlns Result Interpretation Incoming data Meaning mode timestamp-start timestamp-end confidence xf:model xf:instance Input Application-specific elements defined by X Forms data model Text Nomatch Noinput Input Text Xforms definition
What toppings do you have? <interpretation grammar="http://toppings" xmlns:xf="http://www.w3.org/xxx“> <input mode="speech">what toppings to you have?</input> <xf:x-model> <xf: group xf:name="question"/> <xf:string xf:name="questioned_item"/> <xf: string xf:name="questioned_property"/> </xf:group> </xf:x-model> <xf: instance> <app:question> <app:questioned-item>toppings</app:questioned_item> <app:questioned_property>availability</app:questioned_property> </app:question> </xf:instance> </interpretation>
Richer Natural Language • Most current voice apps restrict users to keywords or short phrases • The application does most of the talking • Alternative is to use open grammars with word spotting and let user do the talking • Rules for figuring out what the user said and why as basis for asking next question (GM/AskJeeves Demo)
Multimodal = Voice + Displays What is the weather in San Francisco? • Say which City you want weather for and see the information on your phone • Say which bands/CD’s you want to buy and confirm the choices visually I want to place an orderfor “Hotshot” by Shaggy.
Multimodal Interaction • Multimodal applications • Voice + Display + Key pad + Stylus etc. • User is free to switch between voice interaction and use of display/key pad/clicking/handwriting • July 2000 Published Multimodal Requirements Draft • Demonstrations of Multimodal prototypes at Paris face to face meeting of Voice Browser WG • Joint W3C/WAP Forum workshop on Multimodal – Hong Kong September 2000 • February 2001 – W3C publishes Multimodal Request for Proposals • Plan to set up Multimodal Working Group later this year assuming we get appropriate submission(s)
Multimodal Interaction • Primary market is mobile wireless • cell phones, personal digital assistants and cars • Timescale is driven by deployment of 3G networks • Input modes: • speech, keypads, pointing devices, and electronic ink • Output modes: • speech, audio, and bitmapped or character cell displays • Architecture should allow for both local and remote speech processing
Some Ideas … W3C is seeking detailed proposals with broad industry support as basis for chartering multimodal working group • Speech enabling XHTML (and WML) without requiring changes to markup language • New ECMAScript Speech Object? • Loose coupling of VoiceXML with externally defined pages written in XHTML, SMIL, etc. • Turn-driven synchronization protocol based on SIP? • Distributed Speech Processing • Reduce load on wireless network and speech servers • Increase recognition accuracy in presence of noise • ETSI work on Aurora • Using pen-based gestures to constrain ASR (click and speak)
VoiceXML IP Issues • Technical work on VoiceXML 2.0 is proceeding well • Publication of VoiceXML 2.0 working draft held up over IP issues (although internal version is accessible to W3C Members) • Related specifications for grammar, speech synthesis, natural language synthesis, lexicon, and call control have or shortly will be published. • W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding • W3C is convening a Patent Advisory Group to recommend IP Policy for re-chartering the Voice Browser Activity • Draw inspiration from IETF, ECTF, ETSI and other bodies, e.g. require all WG members to license essential IP under openly specified RAND terms with operational criteria for effective terms expressed in terms of exit criteria for Candidate Recommendation phase. No requirement for advanced disclosure of IP