SSML 1.1 - The Internationalization of the W3C Speech Synthesis Markup Language

SSML 1.1 - The Internationalization of the W3C Speech Synthesis Markup Language SpeechTek 2007 – C102 – Daniel C. Burnett

Overview • SSML 1.0 • Why SSML 1.1? • SSML 1.1 scope • Selected features • Examples • voice/xml:lang • pronunciation alphabets • <w> element • For more info . . .

SSML 1.0 • W3C Recommendation in 2004 • Widely implemented – the primary authoring API for TTS engines • Many extensions

Why SSML 1.1? • 1.0 extensions are primarily to address language-related phenomena • Workshops in China, Greece, and India to understand motivations for these extensions • How to correct tones for East Asian languages? • How to handle transliteration for Indian languages? • How to indicate word boundaries for written languages that do not display them? • How to precisely control voice and language changes?

SSML 1.1 scope • Provide broadened language support • For Mandarin, Cantonese, Hindi*, Arabic*, Russian*, Korean*, and Japanese, we will identify and address language phenomena that must be addressed to enable support for the language. Where possible we will address these phenomena in a way that is most broadly useful across many languages. We have chosen these languages because of their economic impact and expected group expertise and contribution. • We will also consider phenomena of other languages for which there is both sufficient economic impact and group expertise and contribution. • Fix incompatibilities with other Voice Browser Working Group languages, including PLS, SRGS, and VoiceXML 2.0/2.1. • Out of scope: • VCR-like controls: fast-forward, rewind, pause, resume • New <say-as> values. Collecting requirements for future <say-as> work is okay * provided there is sufficient group expertise and contribution for these languages

In scope Token/word boundaries Phonetic alphabets Tones Part of Speech support Text w/multiple languages (separate control of xml:lang and voice) Subword annotation (partial) Syllable-level markup (partial) Out of scope Providing number, case, gender info Simplified/alternate/SMS text Transliteration Expressive (emotion) elements Enhanced prosody rate control SSML 1.1 scope – some workshop topics

Selected new features • SSML 1.1 is a Working Draft – everything from this point on is subject to change • Improved lexicon activation control • Better linkage with PLS lexicons • Clearer separation between xml:lang (document text content) and voice selection • Improved author control of behavior upon xml:lang/voice selection mismatch • Introduction of a Pronunciation Alphabet Registry to allow use of standardized pinyin, jyutping, and other language-specific pronunciation alphabets in addition to the IPA default • New <w> element for marking word boundaries

Examples – voice/xml:lang • Next few examples demonstrate some of the new SSML 1.1 features that provide • Clearer separation between xml:lang (document text content) and voice selection • Improved author control of behavior upon xml:lang/voice selection mismatch

Simple example <speak … xml:lang=“en-US”> <voice languages=“en-US”> I want <voice name=“George” >a big</voice> <voice gender="female“ >pepperoni</voice> pizza. </voice> </speak> • Will find voices that can read US English, each time. • Voice changes are scoped, so the same voice is used for “I want” and “pizza.” • The “name” and “gender” values are requests only, and not required in order for voice selection to be successful.

“required” attribute <speak … xml:lang=“en-US”> <voice languages=“en-US”> I want <voice name=“George” required=“name”>a big</voice> <voice gender="female“ required=“gender”>pepperoni</voice> pizza. </voice> </speak> • Now the name and gender attributes, respectively, are required rather than merely requested. • “required” attribute lists *all* required voice selection features, so the two inner voices might not be able to speak English • If one of the inner voices cannot read/speak English, processor can decide what to do (skip the text, try to read it anyway, or change voice)

“onlangfailure” attribute <speak … xml:lang=“en-US” onlangfailure=“ignoretext”> <voice languages=“en-US”> I want <voice name=“George” required=“name”>a big</voice> <voice gender="female“ required=“gender”>pepperoni</voice> pizza. </voice> </speak> • Now, when any text is encountered that cannot be spoken by the currently selected voice, it will be skipped by the processor. The voice *will not* change. • Other options are “processorchoice”, “ignorelang”, and “changevoice”.

“onvoicefailure” attribute <speak … xml:lang=“en-US” onlangfailure=“ignoretext”> <voice languages=“en-US” onvoicefailure=“keepexisting”> I want <voice name=“George” required=“name”>a big</voice> <voice gender="female“ required=“gender”>pepperoni</voice> pizza. </voice> </speak> • What if the processor can’t find a voice that meets the required criteria? In the above example, the processor will keep the voice it had. This attribute is scoped as well. • Other options are “priorityselect” and “processorchoice”.

Language and accent <speak … xml:lang=“en-US” onlangfailure=“ignoretext”> <voice languages=“zh-cmn:en-US en:en-US” onvoicefailure=“keepexisting”> <lang xml:lang=“zh-cmn”>我想要</lang> <voice name=“George” required=“name”>a big</voice> <voice gender="female“ required=“gender”>pepperoni</voice> pizza. </voice> </speak> • First request is for a voice that can speak both English and Mandarin Chinese with a US-English accent • If voice selection is successful, the voice will be able to speak both the Chinese text and the final “pizza.” • Note that the female voice need not speak either language (as written).

Examples – pronunciation alphabets <speak version="1.1" ...> 此<phoneme alphabet=“pinyin“ ph=“chu4">处</phoneme>不准照相。  </speak> • Developing a new Pronunciation Alphabet Registry • Experts can register pronunciation alphabets for their languages • Can also register historically used alphabets such as ARPAbet and Worldbet • First entries will likely be pinyin, jyutping

Examples – <w> element <speak version="1.1" ...>   <w>南京市</w><w>长江大桥</w>  南京市长<w>江大桥</w> </speak> • <w> element helps resolve ambiguities for languages that may not visually separate words. • Markup is allowed within <w> but does not cause word separation (unlike in the rest of SSML) => allows for sub-word <mark>, <prosody>, etc.

For more info . . . • Information about the Voice Browser Working Group can be found at http://www.w3.org/Voice/ • Current SSML drafts: • http://www.w3.org/TR/ssml11reqs/ • http://www.w3.org/TR/speech-synthesis11/

SSML 1.1 - The Internationalization of the W3C Speech Synthesis Markup Language

SSML 1.1 - The Internationalization of the W3C Speech Synthesis Markup Language

Presentation Transcript

Speech synthesis

The HTML markup language

Speech Synthesis Markup Language -----Aim at Extension

Speech Synthesis

Hypertext markup language

Speech Synthesis

XQuery – The W3C XML Query Language

Demystifying the eXtensible Markup Language

Speech Synthesis

W3C Workshop on Internationalizing SSML SSML Extension for Korean

Extensible Markup Language

Hypertext Markup Language

the scope of speech language pathology

2 nd Workshop for the Internationalization of SSML

W3C Workshop on Internationalizing SSML SSML Extension for Korean

Speech Synthesis Markup Language -----Aim at Extension

Extensible Markup Language