Speech Technology

Speech Technology

HOT!

What are the big players in the area up to? • Google • http://googleblog.blogspot.com/2010/12/can-we-talk-better-speech-technology.html • Microsoft • http://gigaom.com/2010/12/06/microsoft-claims-its-place-in-a-voice-enabled-world/ • Apple • http://www.dailyfinance.com/story/company-news/apples-siri-purchase-heats-up-the-race-toward-a-voice-activated/19458344/ • IBM • http://www.ibm.com/news/in/en/2010/08/20/a896686u56875f96.html • Nuance • http://gigaom.com/2011/01/19/nuance-releases-mobile-sdk-to-speechify-apps/ • Voxeo

Apple, and the case of Siri • Siri: http://www.youtube.com/watch?v=MpjpVAB06O4 • Review of Siri: http://www.youtube.com/watch?v=AohzWSkAU7c&feature=watch_response

Types of dialog systems • by modality • text-based • spoken • graphical user interface • multi-modal • by device • telephone-based systems • PDA systems • in-car systems • robot systems • desktop/laptop systems • native • in-browser systems • in-virtual machine • in-virtual environment • robots • by style • command-based • menu-driven • natural language • by initiative • system initiative • user initiative • mixed initiative • by application • information service • command-and-control • entertainment • education/tutorial • edutainment • reminder systems • companion systems • healthcare • eldercare • assistive/access systems

More about application types • Information providing systems: • weather reports • stock quotes • timetables • ... • Transaction-based systems: • calendar functions • shopping • financial transactions • travel reservations • ...

Why Voice?

Why voice? • Wireless devices have small screens and limited input capabilities. • Telephone keypad can give users only a limited number of choices. • Speech technology is improving. • The exchange of information between a person and a computer is becoming more like a real conversation. • Users want hands-free or eyes-free use. • From a business viewpoint, voice applications open up a host of new revenue opportunities. • There exist many more telephones than computers with the potential to access the Internet.

Traditional Interactive Voice Response (IVR)

Speech versus Touch Tone

Architecture 1

Architecture 2

Today • Presentation of project ideas • TTS evaluation • Short intro to XML • Speech technology standards overview • Speech Synthesis Markup Language (SSML) • Presentation of home assignment 3: ASR evaluation

Project ideas?

Intro to XML

W3C Speech Standards Torbjörn Lager

VoiceXML – a part of the web HTML HTML browser VoiceXML Web servers VoiceXML browser(ASR, TTS, interpreter)

The place of speech technology • … speech technology itself has a very long way to go. … the most important thing may turn out to be not the speech technology itself, but the way in which speech technology connects to all the other technologies. Tim Berners-Lee

The What and Why of Standards • Software standards include terminology, languages and protocols specified by committees of experts for widespread use in the software industry. Software standards have both advantages and disadvantages. • Advantages: • developers can create applications using the standard languages that are portable across a variety of platforms; • products from different vendors are able to interact with each other; • a community of experts evolves around the standard and is available to develop products and services based on the standard. • Disadvantages: • some developers feel that standards may inhibit creativity and stall the introduction of superior technology. • However, in the area of speech, vendors are enthusiastic about standards and frequently complain that standards are not developed fast enough. • Emerging speech-technology standards could give a boost to an industry hampered by proprietary software and hardware.

World Wide Web Consortium http://www.w3.org/

W3C Speech Standards • Speech Recognition Grammar Specification (SRGS) – • What the user can say • Semantic Interpretation for Speech Recognition (SISR) – • What the user means • Speech Synthesis Markup Language (SSML) – • What the user hears • VoiceXML – • Dialog management: What the system is to do

Speech Recognition Grammar Specification (SRGS) • Covers both speech and DTMF (Dual-Tone Multi-Frequency) input. (DTMF is valuable in noisy conditions or when the social context makes it awkward to speak.) • Grammars can be specified in either an XML or an equivalent augmented BNF (ABNF) syntax. • Speech recognition is an inherently uncertain process. Recognizers may report confidence values. • If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (N-best results). • What about statistical language models? Not covered by SRGS!

Semantic Interpretation for Speech Recognition (SISR) <grammar root="answer"> <rule id="answer" scope="public"> <one-of> <item><ruleref uri="#yes"/></item> <item><ruleref uri="#no"/></item> </one-of> </rule> <rule id="yes"> <one-of> <item>yes</item> <item>yeah<tag>yes</tag></item> <item><token>you bet</token><tag>yes</tag></item> <item xml:lang="fr-CA">oui<tag>yes</tag></item> </one-of> </rule> <rule id="no"> <one-of> <item>no</item> <item>nope</item> <item>no way</item> </one-of> <tag>no</tag> </rule> </grammar>

Semantic Interpretation for Speech Recognition (SISR) • I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } }

<grammar root="order"> <rule id="order"> I would like a <ruleref uri="#drink"/> <tag>out.drink = new Object(); out.drink.liquid=rules.drink.type; out.drink.drinksize=rules.drink.drinksize;</tag> and <ruleref uri="#pizza"/> <tag>out.pizza=rules.pizza;</tag> </rule> <rule id="kindofdrink"> <one-of> <item>coke</item> <item>pepsi</item> <item>coca cola<tag>out="coke";</tag></item> </one-of> </rule> <rule id="foodsize"> <tag>out="medium";</tag> <item repeat="0-1"> <one-of> <item>small<tag>out="small";</tag></item> <item>medium</item> <item>large<tag>out="large";</tag></item> <item>regular<tag>out="medium";</tag></item> </one-of> </item> </rule> <rule id="tops"> <tag>out=new Array;</tag> <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> <item repeat="1-"> and <ruleref uri="#top"/> <tag>out.push(rules.top);</tag> </item> </rule> <rule id="top"> <one-of> <item>anchovies</item> <item>pepperoni</item> <item>mushroom<tag>out="mushrooms";</tag></item> <item>mushrooms</item> </one-of> </rule> <rule id="drink"> <ruleref uri="#foodsize"/> <ruleref uri="#kindofdrink"/> <tag>out.drinksize=rules.foodsize; out.type=rules.kindofdrink;</tag> </rule> <rule id="pizza"> <ruleref uri="#number"/> <ruleref uri="#foodsize"/> <tag>out.pizzasize=rules.foodsize; out.number=rules.number;</tag> pizzas with <ruleref uri="#tops"/> <tag>out.topping=rules.tops;</tag> </rule> <rule id="number"> <one-of> <item> <tag>out=1;</tag> <one-of> <item>a</item> <item>one</item> </one-of> </item> <item>two<tag>out=2;</tag></item> <item>three<tag>out=3;</tag></item> </one-of> </rule> </grammar> I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: { liquid:"coke", drinksize:"medium“ }, pizza: { number: 3, pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] }}

Foundational • Grammar (CFG, PSG) • Automata theory (FSMs, FSTs, etc) • Logic • Phonetics • Linguistics • Computer science

Speech Synthesis Markup Language (SSML) • The key concepts of SSML are • interoperability, or interacting with other markup languages (VoiceXML, etc.); • consistency, or providing predictable control of voice output across platforms and across speech synthesis implementations; and • internationalization, or enabling speech output in a large number of languages within or across documents.

Speech Synthesis Markup Language (SSML) – An Example <speak> <s xml:lang="en-US"> <voice name="David" gender="male" age="25"> For English, press <emphasis>one</emphasis>. </voice> </s> <s xml:lang="es-MX"> <voice name="Miguel" gender="male" age="25"> Para español, oprima el <emphasis>dos</emphasis>. </voice> </s> </speak>

Text Structure: p and s Elements • A p element represents a paragraph. An s element represents a sentence. <speak> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </speak>

The phoneme Element • The phoneme element provides a phonemic/phonetic pronunciation for the contained text. <speak> <phoneme alphabet="ipa“ ph="təmei̥ɾou̥">tomato</phoneme> </speak>

The sub Element • The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. <?xml version="1.0"?> <speak> W3C </speak>

The voice Element • The voice element is a production element that requests a change in speaking voice. A selection of attributes is: • gender: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "male", "female", "neutral". • age: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. • name: optional attribute indicating a processor-specific voice name to speak the contained text. <?xml version="1.0"?> <speak> <voice gender="female">Mary had a little lamb,</voice>  <voice gender="female" age=“7">Its fleece was white as snow.</voice>  <voice name="Mike">I want to be like Mike.</voice> </speak>

The emphasis Element • The emphasis element requests that the contained text be spoken with emphasis. <speak> That is a <emphasis> big </emphasis> car! That is a <emphasis level="strong"> huge </emphasis> bank account! </speak>

The break Element • The break element is an empty element that controls the pausing or other prosodic boundaries between words. <speak> Take a deep breath <break/> then continue. Press 1 or wait for the tone. <break time="3s"/> I didn't hear you! <break strength="weak"/> Please repeat. </speak>

The prosody Element • The prosody element permits control of the pitch, speaking rate and volume of the speech output. • The attributes, all optional, are: • pitch: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels. • contour: sets the actual pitch contour for the contained text. The format is specified in Pitch contour below. • range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges. • rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well. • duration: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [CSS2], e.g. "250ms", "3s". • volume: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0. Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.

The prosody Element (cont’d) • Pitch contour. The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. • The algorithm for interpolating between the targets is processor-specific. • In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by "%") and the second value is the value of the pitch attribute (a number followed by "Hz", a relative change, or a label value). <?xml version="1.0"?> <speak> <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"> good morning </prosody> </speak>

Today • Project reminder • Presentation of the results of the TTS evaluation • Speech Synthesis Poetry Slam • Wrapping up TTS (stages of TTS) • Presentation of home assignment 3: ASR evaluation • Automatic speech recognition (ASR) • Natural language understanding (NLU) • Speech Recognition Grammar Specification (SRGS) • Semantic Interpretation for Speech Recognition (SISR) • Thursday's Lab session

Architecture 1

Wrapping up TTS • Stages of TTS: • Structure analysis (sentence splitting) • Text normalisation • Text to phoneme conversion • Prosody analysis • Waveform production • Speech Synthesis Markup Language • enables developers to override default behavior

TTS stages and SSML elements

Prosody analysis • Pitch (intonation or melody), timing (rhythm), pauses, speech rate, emphasis on words, and the relative timing of segments and pauses. • most TTS engines have a prosody analysis algorithm responsible for producing the prosody of synthesized speech, which is often based on the parts of speech. For example, nouns, verbs, and adjectives may be accented; whereas, auxiliary verbs and prepositions may be distressed. • Spoken speech pauses for commas and properly inflects the speech depending upon whether the sentence is declarative, interrogative, or exclamatory. • Prosody rules and algorithms are not perfect and are a topic of ongoing research. Prosody rules for different spoken national languages may be quite different. For example, the prosody for American, British, Indian, and Jamaican pronunciations of English are different.

Speech Recognition(ASR)

Architecture 1

ASR Input and Output • A speech recognizer is a component with the following inputs and outputs: • Input • A grammar or multiple grammars as defined by the SRGS specification. These grammars inform the recognizer of the words and patterns of words to listen for. • An audio stream that may contain speech content that matches the grammar(s). • Parameters: timeouts, recognition thresholds, or N-best result counts. • Output • Descriptions of results that indicate details about the speech content detected by the speech recognizer. Recognizers will include at least a transcription of any detected words. • Errors and other performance information such as confidence

SRGS

SRGS <grammar root="s"> <rule id="s"> hello </rule> </grammar> s -> "hello"

SRGS <grammar root="s"> <rule id="s"> <one-of> <item>hello</item> <item>goodbye</item> </one-of> </rule> </grammar> s -> "hello" s -> "goodbye" s -> "hello" | "goodbye"

SRGS <grammar root="s"> <rule id="s"> hello <item repeat="0-1"> how are you </item> </rule> </grammar> s -> "hello" ("how are you")

SRGS <grammar root="s"> <rule id="s"> <item repeat="1-"> hello </item> </rule> </grammar> s -> "hello" s -> "hello" s s -> "hello"+ NOTE: Listing is no longer possible

Speech Technology