1 / 30

Lessons from Expressive Speech Processing Project

Explore insights and challenges faced during the ESP project for conversational speech synthesis, including analysis of non-verbal utterances and interactive speech features.

boothc
Download Presentation

Lessons from Expressive Speech Processing Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Conversational Speech Synthesis;“Lessons Learned from the Expressive Speech Processing Project” Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology & ATR Spoken Language Communication Research Labs Keihanna Science City, Kyoto 619-0288, Japan nick@nict.go.jp, nick@atr.jp

  2. The JST/CREST ‘ESP’ corpus • The ATR “Expressive Speech Processing” project (JST/CREST) lasted from 4/’00 to 3/’05 and resulted in a corpus of 1,500 hours of natural conversational speech • All recordings were transcribed, and about 10% are annotated for speaking-style, etc. • The corpus is divided into 3 sections : i: esp_f, ii: esp_c, and iii: esp_m

  3. Transcription example 匿名 One “utterance” per line

  4. Sections of the ESP corpus • esp_f • one female speaker, head-mounted mic, 600 hours of daily spoken interactions, emotion/speech-act/etc … • esp_c • 10 adult speakers, 5m 5f, 2 Chinese, 2 English, • 30-minute telephone conversations x 10 weeks • all conversations in Japanese, free content • esp_m • multi-speaker, head-mounted microphones, variety of interaction settings (like esp_f but many more voices)

  5. finding #1: the Function of Conversational Speech • To establish a rapport with the listener • To show interest and attention • To convey propositional content • Contrast “broadcast mode” (one-way) with “interactive mode” (two-way) speech • Speech Synthesis can do broadcast mode but Conversational Speech is two-way!

  6. the hundred most common utterances Backchannels & affect bursts

  7. Non-Verbal Speech Sounds • Short, simple, repetitive noises • How they are spoken is usually more important than what is being said • Some examples: • The word ほんま • Means “really” • Used a lot in Osaka conversations …

  8. Synthesis of Non-Verbal Speech Sounds • The challenge now is how to synthesise these non-lexical speech sounds • the same speaker says the same word in many consistently different ways … • How should they best be (a) described (b) realised?

  9. Tap-to-talk http:feast.atr.jp/imode

  10. Characteristics of Non-Verbal Utterances • Better described by icons? • Short, expressive sounds • Phonetically ambiguous • Prosodically marked • Not well specified by text input! • But frequent and textually ‘transparent’

  11. ‘Wrappers’ and ‘Fillings’ - Interaction Devices • Often used as “edge-markers” • At beginning and end of utterance chunks • Add expressivity to propositional content • Not just “fillers” –they ‘wrap’ the utterance e.g., “erm, it’s very simple, you know”

  12. The Acoustic features of Wrappers (and Fillers) • Prosodically very variable in more than just pitch & duration … • pca dimension reduction shows • 3 components account for more than 50% • 7 components account for more than 80% • Voice-quality comes up in the 1st component!

  13. Voice Quality in Synthesis • Chakai – affect-based unit selection • using “whole-phrase” units • that vary according to expressivity • selected by their acoustics (princomps) • They show affective relationships • and serve a pragmatic (phatic) function

  14. Chakai

  15. KeyTalk

  16. Touch-sensitive Selection • One big advantage of using a midi keyboard is touch-sensitivity – controlled sustain & attack i.e., (perfect for the natural input of prosody) • with pitch-blend as well … • Another is that keys can be intuitively grouped into related sets of utterances

  17. Octave or sub-octave clusters …. 5 & 7 black & white keys Greetings replies opinion calling etc …

  18. Grouping Related Utterances • It remains as future work to group related utterance types and plan a full keyboard for non-verbal speech sound synthesis • Demo software is provided on the cd-rom proceedings – please let me know if you have any helpful ideas or suggestions :-)

  19. Summary • Non-verbal sounds offer a challenge for the synthesis of interactive speech • They are frequent and carry important affective and discourse-flow information • Segments can be selected and reused from a conversational speech corpus

  20. Conclusion • This paper has presented some examples of non-linguistic uses of speech prosody • Synthesis of expressive sounds is easy! • ‘Units’ can be whole phrases • But unit selection is difficult! • They carry subtle differences of meaning • That can be very hard to specify in text

  21. Listen: • Some examples of conversational speech • (a) taken from the corpus (natural) • (b) synthesised using current technology • (c) concatenated from a very-large corpus • Listen to the non-linguistic prosody!

  22. Morning もしもし Morning もしもし Hello こんにちは hi_there_ まいど Haha ハハハ been_a_long_time 久しぶりですねー came_staight_to_the_eighth_floor もう直接、八階の方に、はい Really あ、そうなん Really あ、ほんま seventh_floor_today 七階すか、今日 yeah_yeah うーん、そうそう Hahaha ワーハハハー what_time_did_you_come 何時頃来たんすか just_now さっき about_now さっきぐらいウアハハハハハハ、まじで bit_late ちょっと遅なっ<てんやん、アハ just_in_time ぎりぎりー not_really いや、そういうわけじゃないねんけど yeah_yeah_ye ah はーいあいはいはい Umm うん Yeah そっかそっか So そう came_by_bike あたし自転車やから from_Kyooubashi 京橋のほうじゃなかったですっけ Really そうや Yeah でしょう Umm うん from_Kyoubashi 京橋から by_bike チャリンコですぐですか、あーそんなもんで来れるんやー Yeah そう Yeah うん、だいたい Really あそうなーんすか Yeah うん NATRnext-generation advanced text rendering • The original dialogue • ditto - synthesised • CHATR (& original) • NATR – large-corpus • NATR – more lively

  23. 7. Acknowledgements • This work is supported by the National Institute of Information and Communications Technology (NiCT), and includes contributions from the Japan Science & Technology Corporation(JST), and the Ministry of Public Management, Home Affairs,Posts and Telecommunications, Japan (SCOPE). • The author is especially grateful to the management of ATR Spoken Language Communication Research Labs for their continuing encouragement and support.

  24. Thank youcoming next: The Plone™

  25. Thank you

More Related