400 likes | 409 Views
This paper presents proposals for extending the SSML 1.0 standard from the point of view of Hungarian TTS developers. The proposals include text structure elements, such as word and syllable, to improve text-to-phoneme conversion and prosody prediction. The paper also suggests additional attributes for the word element to provide information about syllable structure and part-of-speech. These proposals aim to make it easier for human editors to add structure to text and to enable easier adaptation of synthesis processors.
E N D
Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications and Media Informatics Budapest University of Technology and Economics, Budapest, Hungary {nemeth,kgeza,toth.b}@tmit.bme.hu
Budapest University of Technology & Economics (BME)Dept. of Telecommunications & Media Informatics (TMIT) • Speech activities: • Coordinator: Gordos Géza D.Sc. Speech Technology Lab(STL) Németh Géza andOlaszy Gábor PhD D.Sc. Telecommunications & Signal Processing Lab(TSP) Tatai PéterMSc Laboratory of Speech Acoustics Vicsi Klára (LSA) D.Sc. • In each lab • 4-6 PhD students • Graduate students • 306 in Speech Information Systems subject (2005)
Basic research • Multi-lingual artificial speech generation (synthesis, STL) • limited vocabulary (e.g., numbers, date, address) • multi-lingual TTS (Hungarian, German, Polish, Spanish) • speech profiles (variability, individual features) • expression/emotion presentation (user’s manual <-> news) • Speech recognition (TSP, LSA) • noise handling (telephone, in-car, ..., TSP) • dictation (good quality, continouos, LSA) • audio indexing (e.g. radio archives, broadcast news, TSP) • speech segmentation (TSP, LSA) • emotion detection (TSP) • Speech understanding (TSP) • Speech databases (LSA, TSP)
Applied Research • Fully proprietary components and solutions: • All parameters controlled, systems are tailor-made for the end-user, Integration of original research results, unique products • T-Mobile Hungary services: E-mail reader 1999-, name- and address reader in reverse directory, 2003 (Motto: Why is the human operator speaking, not the machine?!), Symbian SMS-reader 2002- (STL) • Others: SMS reader 2001-, bookreader 2002-, (STL) • Voice portals (Generali Hungary name dial-in 2004, Hungarian VoiceXML browser, 2003, TSP+STL) • Industrial information systems (STL, TSP) • Unified Messaging (STL) • Call Center (STL, TSP) • Audio user interfaces (especially portable/mobile devices, car information systems, wearable devices, STL, TSP) • Disability (1986-, speech, vision, Hungarian version of Jaws for Windows, notetaker for blind people, STL, TSP, LSA)
Contact information Tel: (+36 1) 463-38-83 Fax: (+36 1) 463-31-07 http://speechlab.tmit.bme.hu email: nemeth@tmit.bme.hu
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Text structure elements already contained in SSML 1.0: • paragraph • sentence Suggested further structuring: • word • syllables
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary This can be used • to help • text-to-phoneme conversion • prosody prediction and prescription • … by giving higher level information, namely • syllable structure • part-of-speech information (Examples given later) • to indicate words in languages that do not use space to separate words
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Reasons to use text structure elements instead of e.g. phoneme, prosody, break, emphasis • Easier for human editor to add • Replacing synthesis processor may necessitate rewriting • phoneme specification • prosody prescription
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggested word element <w [syllables=“…-…”][POS=“…” [number=“…” …]]> … </w> E.g. <w syllables="hosz-szú"> hosszú </w> <w POS="noun" number="plural" case="accusative"> halászsasokat </w>
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggestion extended from other proposals <w [syllables=“…-…”][POS=“…” [number=“…” gender=“…” case=“…” …][morph=“…+…”][tone=“h+l+…”]]> … </w> When not a word, but an expression is labeled: <e [POS=“…” [number=“…” …]> … </e> E.g. three kilos <e POS=“cardinal” number=“plural” gender=“neuter” case=“genitive”]> 3 k. </e>
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary When pronunciation cannot be determined, you can • Add a lexicon elementBUT hard to add all • Specify using phoneme:BUT hard to write & read for human • Add a textual replacement using sub • Provide higher level information Currently this is only say-as
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Other types of higher level information (easier, more natural) • Syllable structure • Part-of-speech information • Language of included foreign text We are going to give you some examples.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Syllable structure Hungarian: • highly agglutinative • pronunciation inference rules are used • rules can be tricked by some words E.g. “egészség” (“health”) Letter combinations might be “s+zs” [S]+[Z]→[Z] but they are in fact “sz+s” [s]+[S]→[S]
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Syllable structure Enough to know syllable structure. Instead of <phoneme alphabet="ipa" ph="ɛgeː#x283;#x283;eːg"> egészség </phoneme> you can write <w syllables="e-gész-ség"> egészség </w> (Note: here you could also write <sub alias="e-gész-ség"> egészség </sub>)
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Part-of-speech • Word forms may have several meanings/pronunciations • Specifying part-of-speech may help E.g. • I will <w POS=“verb” tense=“present”> read </w> the book • I have <w POS=“participle”> read </w> the book
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language • Foreign parts often occur in texts • Using same voice, currently you can • Do nothing • Specify using phoneme • Another desirable approach • Specify lexicon for language and specify language of text
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language Instead of …<speak … xml:lang="en-US">The title of the movie is:<phoneme alphabet="ipa"ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə">La vita è bella </phoneme> (Life is beautiful).
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language you could write …<speak … xml:lang="en-US">The title of the movie is:<phoneme lang="it"> La vita è bella </phoneme> (Life is beautiful).
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language Suggested language attribute <phoneme [lang=“…” | “x-unknown”][ph=“…” [alphabet=“…”]]> …</phoneme> If both lang and ph is given, langhas priority If language is “x-unknown”, LID (language identification) is used. We suggest that “x-unknown” can be used with xml:lang also.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary • Text normalization effectively assisted by say-as element. • The constructs we found appropriate in our practice include:date, time(including time intervals like opening hours), number, currency, name, address. • Additionally suggest as standard values: acronym/abbreviation, web, e-mail, phone, program-code, table, equation.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary • We speak differently in different situations(e.g. speaking with friends, giving a talk at a conference, reading news, reading stories to children) – speaking style • Differences in prosody can be quantified • Emotional speech also in the focus of research • Modern TTS systems are likely to be able to imitate these to some extent
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Speaking style Suggested speaking-style attribute • Can be used where the xml:lang element, i.e. voice, speak, p, s, w • Synthesis processors can define their own set of supported speaking-styles • They should support: "spelling"– can be viewed a special reading style • They may support e.g. "syllabification", "causal", "news reading", "story telling"
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Emotion Suggested emotion attribute • Mentioned here, although prosody is only one of its aspects • Complementary to speaking-style, therefore separate attribute is suggested • Can be used where the xml:lang element, i.e. voice, speak, p, s, w • Possible values: "happiness", "sadness", "anger", "surprise", "disgust", "fear".
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Part-of-speech • Part-of-speech (POS) of word may affect emphasis and other aspects of prosody • Not always possible to automatically determine • More desirable to specify POS than to prescribe prosody (higher level, speaking style can override it) Example in Hungarian: • “Mondd, hogy vagy?” (“Tell me, how are you?”)– interrogative adverb, strong (focus) emphasis • “Igaz, hogy jól vagy?” (“Is it true that you are alright?”)– conjunction, reduced emphasis
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary • Analytic languages (e.g. English, Chinese) • Words are usually short • They convey only one portion of the meaning • Individual words can be stressed • Synthetic languages (e.g. Hungarian, Korean) • Words are often long • Made up of several morphemes and have very complex meanings • Stress, pitch changes, etc. may need to be realized on certain morphemes (~syllables)
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Example 1: contrastive sentences • English:“The book is not in the box, but on the box.” • Speaker can emphasize one word. • Hungarian:“Nem a dobozon, hanem a dobozban van a könyv.” • Speaker sometimes has to emphasize one syllable. • Stress expressed mainly by pitch; may be aided by short pause, slower rate, higher volume.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Example 2: pitch change on syllable • “Elmentek.” – “They are gone.” Pitch is continuously falling • “Elmentek?” – “Are they gone?”Pitch rises at the beginning of the second syllable and falls down on the third syllable 1. 2.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggestion for extensions to prosody: • Stress and prosody can be described on a per-syllable basis • Extension to prosody: time can be syllable position • decimal fractions can also be used • negative values indicate nth position from end • special symbol syl_end indicates end of expression E.g.: <prosody contour=“(syl1,…) (syl1.5,…) (syl2,…) … (syl-1,…)(syl_end,…)”>
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggestion for optional extensions: some synthesis processors may process • pitch-contour (=contour), rate-contour, volume-contourtime positions: the same as in contourrate / volume: described as in rate / volume • emphasis and break extended with a position attribute; value can be syllable position.In this case break will not be an empty element.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggested extensions • <w [syllables=“…-…”] [POS=“…” [number=“…” …]]</w> • <phoneme lang=“…” | “x-unknown” [ph=“…” [alphabet=“…”]]> …</phoneme> • <voice | speak | p | s | w [speaking-style=“spelling” | “syllabification” | “causal” | “news reading” | “story telling” | …] [emotion=“happiness” | “sadness” | “anger” | “surprise” | “disgust” | “fear”] [<xml:lang=“…” | “xml-unknown”>]</voice> • <prosody contour=“(syl1,…) (syl2,…) (syl2.5,…) … (syl-2,…) (syl-1,…) (syl_end,…)”>optionally: pitch-contour (=contour), rate-contour, volume-contour; break, emphasis
Overview Text structure Text-to-phoneme Text Prosody Prosody Summary conversion normalization prediction prescription