400 likes | 410 Views
Proposals for enhancing SSML 1.0 for Hungarian speech synthesis, enabling better text structure analysis, phoneme conversion, and prosody prediction.
E N D
Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications and Media Informatics Budapest University of Technology and Economics, Budapest, Hungary {nemeth,kgeza,toth.b}@tmit.bme.hu
Budapest University of Technology & Economics (BME)Dept. of Telecommunications & Media Informatics (TMIT) • Speech activities: • Coordinator: Gordos Géza D.Sc. Speech Technology Lab(STL) Németh Géza andOlaszy Gábor PhD D.Sc. Telecommunications & Signal Processing Lab(TSP) Tatai PéterMSc Laboratory of Speech Acoustics Vicsi Klára (LSA) D.Sc. • In each lab • 4-6 PhD students • Graduate students • 306 in Speech Information Systems subject (2005)
Basic research • Multi-lingual artificial speech generation (synthesis, STL) • limited vocabulary (e.g., numbers, date, address) • multi-lingual TTS (Hungarian, German, Polish, Spanish) • speech profiles (variability, individual features) • expression/emotion presentation (user’s manual <-> news) • Speech recognition (TSP, LSA) • noise handling (telephone, in-car, ..., TSP) • dictation (good quality, continouos, LSA) • audio indexing (e.g. radio archives, broadcast news, TSP) • speech segmentation (TSP, LSA) • emotion detection (TSP) • Speech understanding (TSP) • Speech databases (LSA, TSP)
Applied Research • Fully proprietary components and solutions: • All parameters controlled, systems are tailor-made for the end-user, Integration of original research results, unique products • T-Mobile Hungary services: E-mail reader 1999-, name- and address reader in reverse directory, 2003 (Motto: Why is the human operator speaking, not the machine?!), Symbian SMS-reader 2002- (STL) • Others: SMS reader 2001-, bookreader 2002-, (STL) • Voice portals (Generali Hungary name dial-in 2004, Hungarian VoiceXML browser, 2003, TSP+STL) • Industrial information systems (STL, TSP) • Unified Messaging (STL) • Call Center (STL, TSP) • Audio user interfaces (especially portable/mobile devices, car information systems, wearable devices, STL, TSP) • Disability (1986-, speech, vision, Hungarian version of Jaws for Windows, notetaker for blind people, STL, TSP, LSA)
Contact information Tel: (+36 1) 463-38-83 Fax: (+36 1) 463-31-07 http://speechlab.tmit.bme.hu email: nemeth@tmit.bme.hu
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Text structure elements already contained in SSML 1.0: • paragraph • sentence Suggested further structuring: • word • syllables
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary This can be used • to help • text-to-phoneme conversion • prosody prediction and prescription • … by giving higher level information, namely • syllable structure • part-of-speech information (Examples given later) • to indicate words in languages that do not use space to separate words
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Reasons to use text structure elements instead of e.g. phoneme, prosody, break, emphasis • Easier for human editor to add • Replacing synthesis processor may necessitate rewriting • phoneme specification • prosody prescription
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggested word element <w [syllables=“…-…”][POS=“…” [number=“…” …]]> … </w> E.g. <w syllables="hosz-szú"> hosszú </w> <w POS="noun" number="plural" case="accusative"> halászsasokat </w>
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggestion extended from other proposals <w [syllables=“…-…”][POS=“…” [number=“…” gender=“…” case=“…” …][morph=“…+…”][tone=“h+l+…”]]> … </w> When not a word, but an expression is labeled: <e [POS=“…” [number=“…” …]> … </e> E.g. three kilos <e POS=“cardinal” number=“plural” gender=“neuter” case=“genitive”]> 3 k. </e>
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary When pronunciation cannot be determined, you can • Add a lexicon elementBUT hard to add all • Specify using phoneme:BUT hard to write & read for human • Add a textual replacement using sub • Provide higher level information Currently this is only say-as
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Other types of higher level information (easier, more natural) • Syllable structure • Part-of-speech information • Language of included foreign text We are going to give you some examples.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Syllable structure Hungarian: • highly agglutinative • pronunciation inference rules are used • rules can be tricked by some words E.g. “egészség” (“health”) Letter combinations might be “s+zs” [S]+[Z]→[Z] but they are in fact “sz+s” [s]+[S]→[S]
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Syllable structure Enough to know syllable structure. Instead of <phoneme alphabet="ipa" ph="ɛgeː#x283;#x283;eːg"> egészség </phoneme> you can write <w syllables="e-gész-ség"> egészség </w> (Note: here you could also write <sub alias="e-gész-ség"> egészség </sub>)
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Part-of-speech • Word forms may have several meanings/pronunciations • Specifying part-of-speech may help E.g. • I will <w POS=“verb” tense=“present”> read </w> the book • I have <w POS=“participle”> read </w> the book
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language • Foreign parts often occur in texts • Using same voice, currently you can • Do nothing • Specify using phoneme • Another desirable approach • Specify lexicon for language and specify language of text
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language Instead of …<speak … xml:lang="en-US">The title of the movie is:<phoneme alphabet="ipa"ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə">La vita è bella </phoneme> (Life is beautiful).
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language you could write …<speak … xml:lang="en-US">The title of the movie is:<phoneme lang="it"> La vita è bella </phoneme> (Life is beautiful).
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Language Suggested language attribute <phoneme [lang=“…” | “x-unknown”][ph=“…” [alphabet=“…”]]> …</phoneme> If both lang and ph is given, langhas priority If language is “x-unknown”, LID (language identification) is used. We suggest that “x-unknown” can be used with xml:lang also.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary • Text normalization effectively assisted by say-as element. • The constructs we found appropriate in our practice include:date, time(including time intervals like opening hours), number, currency, name, address. • Additionally suggest as standard values: acronym/abbreviation, web, e-mail, phone, program-code, table, equation.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary • We speak differently in different situations(e.g. speaking with friends, giving a talk at a conference, reading news, reading stories to children) – speaking style • Differences in prosody can be quantified • Emotional speech also in the focus of research • Modern TTS systems are likely to be able to imitate these to some extent
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Speaking style Suggested speaking-style attribute • Can be used where the xml:lang element, i.e. voice, speak, p, s, w • Synthesis processors can define their own set of supported speaking-styles • They should support: "spelling"– can be viewed a special reading style • They may support e.g. "syllabification", "causal", "news reading", "story telling"
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Emotion Suggested emotion attribute • Mentioned here, although prosody is only one of its aspects • Complementary to speaking-style, therefore separate attribute is suggested • Can be used where the xml:lang element, i.e. voice, speak, p, s, w • Possible values: "happiness", "sadness", "anger", "surprise", "disgust", "fear".
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Part-of-speech • Part-of-speech (POS) of word may affect emphasis and other aspects of prosody • Not always possible to automatically determine • More desirable to specify POS than to prescribe prosody (higher level, speaking style can override it) Example in Hungarian: • “Mondd, hogy vagy?” (“Tell me, how are you?”)– interrogative adverb, strong (focus) emphasis • “Igaz, hogy jól vagy?” (“Is it true that you are alright?”)– conjunction, reduced emphasis
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary • Analytic languages (e.g. English, Chinese) • Words are usually short • They convey only one portion of the meaning • Individual words can be stressed • Synthetic languages (e.g. Hungarian, Korean) • Words are often long • Made up of several morphemes and have very complex meanings • Stress, pitch changes, etc. may need to be realized on certain morphemes (~syllables)
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Example 1: contrastive sentences • English:“The book is not in the box, but on the box.” • Speaker can emphasize one word. • Hungarian:“Nem a dobozon, hanem a dobozban van a könyv.” • Speaker sometimes has to emphasize one syllable. • Stress expressed mainly by pitch; may be aided by short pause, slower rate, higher volume.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Example 2: pitch change on syllable • “Elmentek.” – “They are gone.” Pitch is continuously falling • “Elmentek?” – “Are they gone?”Pitch rises at the beginning of the second syllable and falls down on the third syllable 1. 2.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggestion for extensions to prosody: • Stress and prosody can be described on a per-syllable basis • Extension to prosody: time can be syllable position • decimal fractions can also be used • negative values indicate nth position from end • special symbol syl_end indicates end of expression E.g.: <prosody contour=“(syl1,…) (syl1.5,…) (syl2,…) … (syl-1,…)(syl_end,…)”>
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggestion for optional extensions: some synthesis processors may process • pitch-contour (=contour), rate-contour, volume-contourtime positions: the same as in contourrate / volume: described as in rate / volume • emphasis and break extended with a position attribute; value can be syllable position.In this case break will not be an empty element.
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary
Overview Text structure Text-to-phoneme conversion Text normalization Prosody prediction Prosody prescription Summary Suggested extensions • <w [syllables=“…-…”] [POS=“…” [number=“…” …]]</w> • <phoneme lang=“…” | “x-unknown” [ph=“…” [alphabet=“…”]]> …</phoneme> • <voice | speak | p | s | w [speaking-style=“spelling” | “syllabification” | “causal” | “news reading” | “story telling” | …] [emotion=“happiness” | “sadness” | “anger” | “surprise” | “disgust” | “fear”] [<xml:lang=“…” | “xml-unknown”>]</voice> • <prosody contour=“(syl1,…) (syl2,…) (syl2.5,…) … (syl-2,…) (syl-1,…) (syl_end,…)”>optionally: pitch-contour (=contour), rate-contour, volume-contour; break, emphasis
Overview Text structure Text-to-phoneme Text Prosody Prosody Summary conversion normalization prediction prescription