Lessons from Expressive Speech Processing Project

Towards Conversational Speech Synthesis;“Lessons Learned from the Expressive Speech Processing Project” Nick Campbell NiCT / ATR-SLC National Institute of Information and Communications Technology & ATR Spoken Language Communication Research Labs Keihanna Science City, Kyoto 619-0288, Japan nick@nict.go.jp, nick@atr.jp

The JST/CREST ‘ESP’ corpus • The ATR “Expressive Speech Processing” project (JST/CREST) lasted from 4/’00 to 3/’05 and resulted in a corpus of 1,500 hours of natural conversational speech • All recordings were transcribed, and about 10% are annotated for speaking-style, etc. • The corpus is divided into 3 sections : i: esp_f, ii: esp_c, and iii: esp_m

Transcription example 匿名 One “utterance” per line

Sections of the ESP corpus • esp_f • one female speaker, head-mounted mic, 600 hours of daily spoken interactions, emotion/speech-act/etc … • esp_c • 10 adult speakers, 5m 5f, 2 Chinese, 2 English, • 30-minute telephone conversations x 10 weeks • all conversations in Japanese, free content • esp_m • multi-speaker, head-mounted microphones, variety of interaction settings (like esp_f but many more voices)

finding #1: the Function of Conversational Speech • To establish a rapport with the listener • To show interest and attention • To convey propositional content • Contrast “broadcast mode” (one-way) with “interactive mode” (two-way) speech • Speech Synthesis can do broadcast mode but Conversational Speech is two-way!

the hundred most common utterances Backchannels & affect bursts

Non-Verbal Speech Sounds • Short, simple, repetitive noises • How they are spoken is usually more important than what is being said • Some examples: • The word ほんま • Means “really” • Used a lot in Osaka conversations …

Synthesis of Non-Verbal Speech Sounds • The challenge now is how to synthesise these non-lexical speech sounds • the same speaker says the same word in many consistently different ways … • How should they best be (a) described (b) realised?

Tap-to-talk http:feast.atr.jp/imode

Characteristics of Non-Verbal Utterances • Better described by icons? • Short, expressive sounds • Phonetically ambiguous • Prosodically marked • Not well specified by text input! • But frequent and textually ‘transparent’

‘Wrappers’ and ‘Fillings’ - Interaction Devices • Often used as “edge-markers” • At beginning and end of utterance chunks • Add expressivity to propositional content • Not just “fillers” –they ‘wrap’ the utterance e.g., “erm, it’s very simple, you know”

The Acoustic features of Wrappers (and Fillers) • Prosodically very variable in more than just pitch & duration … • pca dimension reduction shows • 3 components account for more than 50% • 7 components account for more than 80% • Voice-quality comes up in the 1st component!

Voice Quality in Synthesis • Chakai – affect-based unit selection • using “whole-phrase” units • that vary according to expressivity • selected by their acoustics (princomps) • They show affective relationships • and serve a pragmatic (phatic) function

Chakai

KeyTalk

Touch-sensitive Selection • One big advantage of using a midi keyboard is touch-sensitivity – controlled sustain & attack i.e., (perfect for the natural input of prosody) • with pitch-blend as well … • Another is that keys can be intuitively grouped into related sets of utterances

Octave or sub-octave clusters …. 5 & 7 black & white keys Greetings replies opinion calling etc …

Grouping Related Utterances • It remains as future work to group related utterance types and plan a full keyboard for non-verbal speech sound synthesis • Demo software is provided on the cd-rom proceedings – please let me know if you have any helpful ideas or suggestions :-)

Summary • Non-verbal sounds offer a challenge for the synthesis of interactive speech • They are frequent and carry important affective and discourse-flow information • Segments can be selected and reused from a conversational speech corpus

Conclusion • This paper has presented some examples of non-linguistic uses of speech prosody • Synthesis of expressive sounds is easy! • ‘Units’ can be whole phrases • But unit selection is difficult! • They carry subtle differences of meaning • That can be very hard to specify in text

Listen: • Some examples of conversational speech • (a) taken from the corpus (natural) • (b) synthesised using current technology • (c) concatenated from a very-large corpus • Listen to the non-linguistic prosody!

Morning もしもし Morning もしもし Hello こんにちは hi_there_ まいど Haha ハハハ been_a_long_time 久しぶりですねー came_staight_to_the_eighth_floor もう直接、八階の方に、はい Really あ、そうなん Really あ、ほんま seventh_floor_today 七階すか、今日 yeah_yeah うーん、そうそう Hahaha ワーハハハー what_time_did_you_come 何時頃来たんすか just_now さっき about_now さっきぐらいウアハハハハハハ、まじで bit_late ちょっと遅なっ＜てんやん、アハ just_in_time ぎりぎりー not_really いや、そういうわけじゃないねんけど yeah_yeah_ye ah はーいあいはいはい Umm うん Yeah そっかそっか So そう came_by_bike あたし自転車やから from_Kyooubashi 京橋のほうじゃなかったですっけ Really そうや Yeah でしょう Umm うん from_Kyoubashi 京橋から by_bike チャリンコですぐですか、あーそんなもんで来れるんやー Yeah そう Yeah うん、だいたい Really あそうなーんすか Yeah うん NATRnext-generation advanced text rendering • The original dialogue • ditto - synthesised • CHATR (& original） • NATR – large-corpus • NATR – more lively

7. Acknowledgements • This work is supported by the National Institute of Information and Communications Technology (NiCT), and includes contributions from the Japan Science & Technology Corporation(JST), and the Ministry of Public Management, Home Affairs,Posts and Telecommunications, Japan (SCOPE). • The author is especially grateful to the management of ATR Spoken Language Communication Research Labs for their continuing encouragement and support.

Thank youcoming next: The Plone™

Thank you

Lessons from Expressive Speech Processing Project

Lessons from Expressive Speech Processing Project

Presentation Transcript

Can Internet transport technology support Grid Applications ?

Requirement Study

Advances in ITS/Telematics technology and applications

Some Activities in Japan

Information and Communications Technology (ICT )

Visual Communications F ashion Design

PRESENTED TO NICT PROGRAM ON E-LIBRARY COURSE

Mississippi Technology Alliance Sixth Annual Conference on High Technology

State of Texas Department of Information Resources Communications Technology Services

An ID/locator split architecture for future networks

Distributed Session Announcement Agents for Real-time Streaming Applications

Debbie Campbell Director, Coordination Support Branch National Library of Australia

Egypt’s ICT Sector Experience in the National Accounts

Frequency band consideration of SG-MBAN

Toward Prediction of Relativistic Electron Environment in Geospace

National Research Council Canada (NRC) and the Industry Partnership Facilities (IPF’s)

AFFECTS Advanced Forecast For Ensuring Communications Through Space

Masaki Hirabaru masaki@nict.go.jp NICT

National Cancer Institute Orientation

Dan Tobin Matt Campbell

NATIONAL INSTITUTE OF AEROSPACE TECHNOLOGY

Using Information Technology