170 likes | 183 Views
SSML Extension for Expressive Mandarin TTS. Shuang Li Hongwu Yang Lianhong Cai Tsinghua University. Outline. Motivation. Expression of Speech. Proposed SSML extension. Conclusion. Motivation(1/3). Sentences with the same text can be expressed with different styles, emotions and moods.
E N D
SSML Extension for Expressive Mandarin TTS Shuang Li Hongwu Yang Lianhong Cai Tsinghua University
Outline • Motivation • Expression of Speech • Proposed SSML extension • Conclusion
Motivation(1/3) • Sentences with the same text can be expressed with different styles, emotions and moods • Current tts system lacks variability
Motivation(2/3) • Current SSML cannot define speaking style, emotion and mood • Good news: 生日快乐 “Happy birthday” expressed in happiness (emotion) • Bad news: 张总去世了 “Director Zhang passed away” expressed in sadness (emotion) • Information provider: 飞往纽约的飞机将要起飞 “Flight for New York is going to take off”: Expressed in a mild mood • Dialog: 是中国队赢了吗? “Did Chinese team win?”: Emphasize “Chinese”, with interrogative mood • Current SSML hard to show the difference between the expressions above
Motivation(3/3) Expressive speech Phisiological/social characteristics Voice tag characteristic Expressing pattern No tag style news Sports comment dialog Info providing …… Phisiological reactations No tag emotion Positive, neutral, negative • Emotion, style and characteristic are relatively independent but cannot be separated • Characteristic and style: relatively stable and global features • Emotion: short-time, local feature • With different speaking styles • Representing speaker’s attitude, purpose and emotion • More harmonious with the circumstance
Outline • Motivation • Expression of Speech • Proposed SSML extension • Conclusion
Expression of Speech Style :speaking style( dialog, news, information providing…) Mood :mood( request, acquisition, affirmation, apology…) Emotion :emotional activities( neutral, negative, positive)
Hierarchical framework of Prosody • Break level • B0: no break • B1: Syllable • B2: Prosodic word • B3: Prosodic Phrase • B4: Breath Group • B5: Prosodic Group • Chiu-yu Tseng,et al. Fluent speech prosody: Framework and modeling. Speech Communication, 46(2005) 284-399
我永远忘不了<B3/25ms>一张对日抗战时的新闻照片,<B3/507ms>轰炸后的废墟焦土上,<B3/272ms>一个衣不蔽体、<B3/384ms>满身尘土灰烟的幼儿<B3/100ms>坐在地上<B3/75ms>无助的大哭着。<B5/1110ms>那是一再令我热泪盈眶的镜头。<B3/507ms>新闻摄影中的战争传真<B3/276ms>已不能只称是照片了。<B5/802ms>我永远忘不了<B3/25ms>一张对日抗战时的新闻照片,<B3/507ms>轰炸后的废墟焦土上,<B3/272ms>一个衣不蔽体、<B3/384ms>满身尘土灰烟的幼儿<B3/100ms>坐在地上<B3/75ms>无助的大哭着。<B5/1110ms>那是一再令我热泪盈眶的镜头。<B3/507ms>新闻摄影中的战争传真<B3/276ms>已不能只称是照片了。<B5/802ms> • From Chiu-yu Tseng, report in Beijing University, Oct 11, 2005
Outline • Introduction • Expression of Speech • Proposed SSML extension • Conclusion
Proposed tag(1/2) • Utterance: prosodic group, expressing a complete meaning • Attributes: Style:speaking style Value: News, Reading, Information provider, dialog, etc Emotion: speaking emotion Value: Happy、Sad、Angry、Calm、Despair, etc +1 for positive,0 for neutral, -1 for negative mood:speaking mood Value: given, request, acquisition, affirmation,apology, etc
Proposed tag(2/2) • BG: breath group • attributes: intonation: Value: indicative, interrogative, imperative • PPh: prosodic phrase • PW: prosodic word • Syl: Syllable
Some examples(1/3) • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang=“zh-CN"> • <utterence style=”information provide” emotion=”-1” mood=”apology”> • <bg intonation=” indicative”> • <pph>1121次航班(Flight 1121)</pph> • <pph>延误(has been delayed ) • <pw><emphasis level=”strong”>1小时(for an hour )</emphasis></pw></pph> • <break strength=”medium”, time=”215ms”/> • <pph>请旅客们到(Please go to )</pph> • <pw><emphasis=”moderate”>G6</emphasis=”moderate”></pw> • <pph>候机厅等候(the waiting room)</pph> • </bg> • </utterence> • </speak>
Some examples(2/3) • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang=“zh-CN"> • <utterence style=”dialog” emotion=”neutral” mood=”acquisition”> • <bg intonation=”interrogative”> • <pph><pw> • <emphasis level=”strong”>张威(Zhang Wei )</emphasis> • </pw></pph> • <break strength=medium time=75ms/> • <pph>担心肖荫开车发晕(is afraid of Xiao Yin being dizzy when driving )</pph> • </bg> • </utterence> • </speak>
Some examples(3/3) • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang=“zh-CN"> • <utterence style=”dialog” emotion=”angery”> • <bg intonation=”interrogative”> • <prosody rate=”x-fast”>难道不是你的错吗?(Isn’t it your fault? ) • <break strength=”medium” time=”520ms”/> • </bg> • <bg intonation=”imperative”> • 以后你小心一点(Be careful next time) • </bg> • </utterence> • </speak>
Outline • Motivation • Expression of Speech • Proposed SSML extension • Conclusion
Conclusion & question? • 5 elements for hierarchic prosodic structure • utterance, bg, pph, pw, syl • 3 expressive attributes for utterance • style • emotion • mood • 1 intonation attributes for bg • intonation