Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology &

Towards Conversational Speech Synthesis;“Lessons Learned from the Expressive Speech Processing Project” Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology & ATR Spoken Language Communication Research Labs Keihanna Science City, Kyoto 619-0288, Japan nick@nict.go.jp, nick@atr.jp

The JST/CREST ‘ESP’ corpus • The ATR “Expressive Speech Processing” project (JST/CREST) lasted from 4/’00 to 3/’05 and resulted in a corpus of 1,500 hours of natural conversational speech • All recordings were transcribed, and about 10% are annotated for speaking-style, etc. • The corpus is divided into 3 sections : i: esp_f, ii: esp_c, and iii: esp_m

Transcription example 匿名 One “utterance” per line

Sections of the ESP corpus • esp_f • one female speaker, head-mounted mic, 600 hours of daily spoken interactions, emotion/speech-act/etc … • esp_c • 10 adult speakers, 5m 5f, 2 Chinese, 2 English, • 30-minute telephone conversations x 10 weeks • all conversations in Japanese, free content • esp_m • multi-speaker, head-mounted microphones, variety of interaction settings (like esp_f but many more voices)

finding #1: the Function of Conversational Speech • To establish a rapport with the listener • To show interest and attention • To convey propositional content • Contrast “broadcast mode” (one-way) with “interactive mode” (two-way) speech • Speech Synthesis can do broadcast mode but Conversational Speech is two-way!

the hundred most common utterances Backchannels & affect bursts

Non-Verbal Speech Sounds • Short, simple, repetitive noises • How they are spoken is usually more important than what is being said • Some examples: • The word ほんま • Means “really” • Used a lot in Osaka conversations …

Synthesis of Non-Verbal Speech Sounds • The challenge now is how to synthesise these non-lexical speech sounds • the same speaker says the same word in many consistently different ways … • How should they best be (a) described (b) realised?

Tap-to-talk http:feast.atr.jp/imode

Characteristics of Non-Verbal Utterances • Better described by icons? • Short, expressive sounds • Phonetically ambiguous • Prosodically marked • Not well specified by text input! • But frequent and textually ‘transparent’

‘Wrappers’ and ‘Fillings’ - Interaction Devices • Often used as “edge-markers” • At beginning and end of utterance chunks • Add expressivity to propositional content • Not just “fillers” –they ‘wrap’ the utterance e.g., “erm, it’s very simple, you know”

The Acoustic features of Wrappers (and Fillers) • Prosodically very variable in more than just pitch & duration … • pca dimension reduction shows • 3 components account for more than 50% • 7 components account for more than 80% • Voice-quality comes up in the 1st component!

Voice Quality in Synthesis • Chakai – affect-based unit selection • using “whole-phrase” units • that vary according to expressivity • selected by their acoustics (princomps) • They show affective relationships • and serve a pragmatic (phatic) function

Chakai

KeyTalk

Touch-sensitive Selection • One big advantage of using a midi keyboard is touch-sensitivity – controlled sustain & attack i.e., (perfect for the natural input of prosody) • with pitch-blend as well … • Another is that keys can be intuitively grouped into related sets of utterances

Octave or sub-octave clusters …. 5 & 7 black & white keys Greetings replies opinion calling etc …

Grouping Related Utterances • It remains as future work to group related utterance types and plan a full keyboard for non-verbal speech sound synthesis • Demo software is provided on the cd-rom proceedings – please let me know if you have any helpful ideas or suggestions :-)

Summary • Non-verbal sounds offer a challenge for the synthesis of interactive speech • They are frequent and carry important affective and discourse-flow information • Segments can be selected and reused from a conversational speech corpus

Conclusion • This paper has presented some examples of non-linguistic uses of speech prosody • Synthesis of expressive sounds is easy! • ‘Units’ can be whole phrases • But unit selection is difficult! • They carry subtle differences of meaning • That can be very hard to specify in text

Listen: • Some examples of conversational speech • (a) taken from the corpus (natural) • (b) synthesised using current technology • (c) concatenated from a very-large corpus • Listen to the non-linguistic prosody!

Morning もしもし Morning もしもし Hello こんにちは hi_there_ まいど Haha ハハハ been_a_long_time 久しぶりですねー came_staight_to_the_eighth_floor もう直接、八階の方に、はい Really あ、そうなん Really あ、ほんま seventh_floor_today 七階すか、今日 yeah_yeah うーん、そうそう Hahaha ワーハハハー what_time_did_you_come 何時頃来たんすか just_now さっき about_now さっきぐらいウアハハハハハハ、まじで bit_late ちょっと遅なっ＜てんやん、アハ just_in_time ぎりぎりー not_really いや、そういうわけじゃないねんけど yeah_yeah_ye ah はーいあいはいはい Umm うん Yeah そっかそっか So そう came_by_bike あたし自転車やから from_Kyooubashi 京橋のほうじゃなかったですっけ Really そうや Yeah でしょう Umm うん from_Kyoubashi 京橋から by_bike チャリンコですぐですか、あーそんなもんで来れるんやー Yeah そう Yeah うん、だいたい Really あそうなーんすか Yeah うん NATRnext-generation advanced text rendering • The original dialogue • ditto - synthesised • CHATR (& original） • NATR – large-corpus • NATR – more lively

7. Acknowledgements • This work is supported by the National Institute of Information and Communications Technology (NiCT), and includes contributions from the Japan Science & Technology Corporation(JST), and the Ministry of Public Management, Home Affairs,Posts and Telecommunications, Japan (SCOPE). • The author is especially grateful to the management of ATR Spoken Language Communication Research Labs for their continuing encouragement and support.

Thank youcoming next: The Plone™

Thank you

Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology &