Generating narrative speech for the Virtual Storyteller

Generating narrative speech for the Virtual Storyteller Koen Meijs, Mariet Theune, Dirk Heylen* and others Generating narrative speech for the Virtual Storyteller

Overview • Background: The Virtual Storyteller • Analysis of human storytellers • Conversion rules and testing • Implementation • Evaluation • Conclusions and future work Generating narrative speech for the Virtual Storyteller

The Virtual Storyteller Automatic story generation: • Plot creation • Natural language generation • Storytelling Generating narrative speech for the Virtual Storyteller

Plot creation Characters in the story are (semi) autonomous agents, which: • Have their own personality, goals and emotions • Can perform planned actions to reach their goals • Are guided by a director agent Generating narrative speech for the Virtual Storyteller

Language generation using simple sentence templates Story presentation by an embodied, speaking agent (using Microsoft Agents as a temporary solution) NLG and story presentation Generating narrative speech for the Virtual Storyteller

Example story setting NB: Visualisation is not part of the system yet! Generating narrative speech for the Virtual Storyteller

Example story text Diana walked to the forest. Brutus walked to the plains. Diana picked up the sword. Brutus walked to the desert. Diana walked to the desert. Brutus was afraid of Diana because Brutus saw that Diana had the sword. Brutus hit Diana. Diana was afraid of Brutus because Diana saw Brutus. Diana walked to the forest. Brutus was afraid of Diana because Brutus saw that Diana had the sword. Brutus walked to the forest. Diana stabbed the villain. And she lived happily ever after!!! Generating narrative speech for the Virtual Storyteller

Storytellers’ speech Human storytellers engage their audience by: • General “storytelling” speech style • Different voices for characters • Expressing emotions • Different “sound effects” Generating narrative speech for the Virtual Storyteller

Focus of this work • General storytelling style • Use of prosody to express suspense in stories Generating narrative speech for the Virtual Storyteller

Analysis of human speakers Global storytelling style,material from: • newsreader (Onno Duyvené de Wit) • children’s storyteller (Sacco van der Made) • adult storyteller (Toon Tellegen) Analysis (using PRAAT) mainly based on children’s storyteller Generating narrative speech for the Virtual Storyteller

Features • Pitch • Intensity • Tempo (syllables per second) • Pause duration • Vowel length Generating narrative speech for the Virtual Storyteller

Global storytelling style Pitch / intensity: • Averages are similar • Standard deviation is much larger for storyteller newsreader children’s storyteller Generating narrative speech for the Virtual Storyteller

Global storytelling style Tempo (syllables per second): newsreader is much faster than both storytellers Pause duration: storyteller pauses are longer (esp. between sentences) Also: lengthening of certain adverbs/adjectives by storyteller (“A long corridor that was s o low …”) Generating narrative speech for the Virtual Storyteller

Expressing suspense • Sudden climax: an unexpected revelation. E.g., opening Bluebeard’s secret chamber: “She had to get used to the darkness, and then …” • Increasing climax: building up expectation. Finally finding the Sleeping Beauty: “He opened the door and… there was the sleeping princess.” Generating narrative speech for the Virtual Storyteller

Sudden climax • “En toen…” / “And then…” • Sudden rise in pitch and intensity on “then” • Vowel lengthening in “then” Generating narrative speech for the Virtual Storyteller

Increasing climax • Two parts: 1 creating expectation 2 revelation • First part: increasing pitch and vowel duration • Second part: more constant, lower pitch and intensity Generating narrative speech for the Virtual Storyteller

Conversion rules • Conversion from ‘neutral’ to ‘storytelling’ speech • Rules based on analysis of human speakers • Input: paired time-value data • Output: new values for a given time domain Generating narrative speech for the Virtual Storyteller

Example from storytelling style • Pitch: increase the pitch of syllables carrying a sentence accent • All pitch values inside the syllable’s time domain are multiplied by a certain factor (based on a sine function) • Maximum increase between 40-90 Hz → best value to be determined experimentally Generating narrative speech for the Virtual Storyteller

Determining constant values • Material: speech produced by Fluency text-to-speech, manipulated using PRAAT scripts • Five subjects compared 22 speech fragment pairs with different values for one constant • Subjects had to indicate: • Which fragment sounded most natural or • Which had the best expression of suspense Generating narrative speech for the Virtual Storyteller

Results: storytelling style Generating narrative speech for the Virtual Storyteller

Results: sudden climax “Everybody waited in silence, and then ... there was a loud bang!” Generating narrative speech for the Virtual Storyteller

Results: increasing climax “Step by step he jumped from stone to stone, slipped on the last stone and… fell into the water.” Neutral: Pitch contour manipulated: Generating narrative speech for the Virtual Storyteller

Pilot test of conversion rules • 16 speech fragments: • 8 ‘neutral’ (Fluency, with no manipulation) • 8 manipulated using PRAAT according to conversion rules, using best constant values • Eight subjects rated storytelling quality, naturalness, and suspense on a five-point scale (subjects divided in two groups) Generating narrative speech for the Virtual Storyteller

Generating narrative speech for the Virtual Storyteller

Pilot test results Compared to neutral fragments, • Storytelling quality of manipulated fragments was rated equal or better • Naturalness of manipulated fragments was rated equal or less • Manipulated fragments were rated as having more suspense, even if only the ‘global storytelling style’ was used Generating narrative speech for the Virtual Storyteller

annotated text input narrative speech neutral prosodic information narrative prosodic information application of conversion rules partial synthesis (Fluency) resynthesis (Fluency) Implementation Prosodic information = list of phonemes with pitch and duration values (no possible to adjust intensity) Generating narrative speech for the Virtual Storyteller

Example annotated text Annotation: extension of SSML. <speak> <style type=narrative/> <s> The beard made him look <accent extend=yes> so </accent> ugly that everybody ran away when they saw him. </s> <s> He wanted to turn around <climax type=sudden> and then </climax> there was a loud bang. </s> <s> Bluebeard raised the big knife, <climax type=increasing> he wanted to strike and <climax_top/> there was a knock on the door. </climax> </s> </speak> Generating narrative speech for the Virtual Storyteller

Example prosodic information 1: h 112 2: I: 151 50 75 3: R 75 4: l 75 5: @ 47 20 71 70 61 6: k 131 7: @ 55 80 70 8: _ 11 50 65 • Phoneme • Duration (ms) • Pitch percentage (specifying at which point during the phoneme the pitch value should be applied) • Pitch value Generating narrative speech for the Virtual Storyteller

Conversion steps • Parse XML • Look up phonemes to be manipulated • Apply function For example, pitch for global storytelling style: y(t).(sin((((t-t1)/(t2-t1))0,5π) + 0,25π)/n)), where n = average pitch / 40 • Return adapted values NB: intensity cannot be adapted in Fluency Generating narrative speech for the Virtual Storyteller

Evaluation of implementation • Set-up similar to conversion rule pilot test • 16 fragments (8 neutral / narrative pairs) • 20 subjects, divided in two groups • Rating storytelling quality, naturalness, and suspense on a 5 point scale Generating narrative speech for the Virtual Storyteller

Mean scores Significant differences (≤ 0,05) are shown in bold face. Underlining indicates near significance. Generating narrative speech for the Virtual Storyteller

Summing up the results • Storytelling quality of manipulated fragments: rated above average, and better than neutral fragments (but hardly significant) • Naturalness: ratings vary; some accents were seen as misplaced (though copied from original fragment) • Suspense of manipulated fragments rated higher than neutral fragments (some significance) Generating narrative speech for the Virtual Storyteller

Conclusions & future work • Successful automatic conversion from standard text-to-speech to ‘storytelling prosody’ • Further improvement and larger-scale evaluation still needed • Automatic derivation of features from text? Generating narrative speech for the Virtual Storyteller

Generating narrative speech for the Virtual Storyteller

Generating narrative speech for the Virtual Storyteller

Presentation Transcript

Storyteller

STORYTELLER

storyteller

The Storyteller

Vocabulary Preview: “The Storyteller”

STORYTELLER

STORYTELLER

STORYTELLER

STORYTELLER

STORYTELLER

Generating Virtual Webpages

STORYTELLER

The Value Proposition Generating Your Elevator Speech

STORYTELLER

STORYTELLER

STORYTELLER

STORYTELLER

STORYTELLER

Narrative Speech

Pattachitra: The Storyteller