Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Speech Synthesis in the SPACE Reading TutorClosing Symposium of the SPACE Project06 FEB 2009 Yuk On Kong, Lukas Latacz, Werner VerhelstLaboratory for Digital Speech and Audio ProcessingVrije Universiteit Brussel

Introduction

To Record or Not to Record: That’s the question. • Pre-recorded speech in existing reading tutors • Advantages / disadvantages?

Application-specific TTS • Speaker / voice • Material in speech corpus • How to synthesize • Any extra mode necessary? • the child is too slow… • How to maximize quality

Speaker / Voice Speaker • Appealing to children • Female speaker • Standard Flemish pronunciation, no noticeable regional accent • Experienced speaker

Material inSpeech Corpus Database (about 6 hours) • Material from stories for children • Words expected at 6 years of age • Diphones

How to synthesize • Based on the general unit selection paradigm. • Heterogeneous units: units could be of various sizes • Bases: • Use of longer chunks leads to quality improvement. • Used for synthesizing domain-specific utterances. oma o ma _-o o-m m-a Fig. Word “oma” to synthesize and multi-tier segmentation in word, syllable and segment

How to synthesize • Basic algorithm: • Search top-down and select longest sequence of targets at each level and go to lower levels if no candidates are found. • Coarticulation: • Even across word boundaries • Level: diphone, syllable, word, phrase

How to synthesize • Front-end Als het flink vriest, kunnen we schaatsen. Tokenisation Text Normalisation Phrase and Pause Prediction Part of speech Word Pronunciation Silence Insertion ToDI Intonation Word Accent Back-end Unit Selection Unit Concatenation Speech DB

How to synthesize Those with a * are also calculated for the neighboring segments, syllables or words. “Neighboring syllables” are restricted to the syllables of the current word. As for segments & words, three neighbors on the left and three on the right are taken into account. Target prosody is described symbolically Best sequence of units is selected • Weighted sum of target and join costs • Viterbi search Joins: • Costs based on spectrum, pitch, energy, duration and adjacency • PSOLA-based algorithm with optimal coupling

Extra Modes? Phoneme-by-phoneme mode • Stress Syllable mode

Extra Modes? Demonstration: • Phoneme-by-phoneme • Stress on first phoneme • Syllable • Normal mode

The Child is Too Slow… Choosing the appropriate reading speed for the child • Uniform WSOLA time-scaling • Insertion of additional silences between neighboring words Reading along

The Child is Too Slow… Synthesizer Assessment Errordetection Tracking Teacher’smodule Synthesis module Readingtutor Playback module Commands & Timing Info Audio Cygwin Windows XP

How to Maximize Quality Major synthesis problems • Join artifacts • Inappropriate prosody Interactive tuning of synthesis • Assisted by quality management • User can make small changes to the input text

How to Maximize Quality Approach: • For each word, calculate average target and join costs • Predictor: • : threshold based on max and min of cost c • uj usually lies between 0 & 1 because of training settings. • Accept if uj < 0.5 and reject otherwise. • Weights: linear regression • Best alphas found iteratively (maximizing f-score)

Other Special Aspects • Phrase and Silence Prediction • Context-dependent Weight Training

Phrase and Silence Prediction Type of pauses: heavy, medium and light • Phrase breaks: both heavy and medium pauses Training • No manual labeling, but based on the pauses automatically labeled in the speech database • Iterative classification based on these pauses • Training of memory-based learner (features such as POS, punctuation, ...)

Context-dependent Weight Training Automatic adaptation (tuning) of weights Context-dependent weights • Context is described symbolically per phone Training: • Optimizing weights • Clustering optimized weights (decision trees)

Context-dependent Weight Training 7 subjects 4 conditions • Randomly selected corpus; Context-dependent weights • Randomly selected corpus; Untrained weights • Corpus selected based on word frequency; Context-dependent weights • Corpus selected based on word frequency; Untrained weights 25 test utterances, AVI 1-5 (5 utt./level) Results:

Demonstration Hierarchical unit selection: • AVI 1: “Dit is te gek, gilt ze.” • AVI 3: “Toch had hij liever de hond gehad.” • AVI 5: “Roel ligt nog een paar dagen in het ziekenhuis.” • AVI 7: “De kleine huizen staan dicht tegen elkaar aan.” • AVI 9: “Nou Henk, zie je nu wel dat je moeder hier fantastisch verzorgd wordt!”

WSOLA Top: original signalBottom: WSOLA time-scaling Illustration of the WSOLA strategy

Other Application • Audio-visual TTS • Example: “The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.” • Database containing about 20 minutes (LIPS Challenge ’08) • For better audio quality, the database should be much larger

Future Work • Optimizing synthesis • User feedback • Expressive speech synthesis • Automated prosodic annotations • Quality Management • Evaluation & optimization of the algorithm • Compare with the perceived quality of synthesized sentences (MOS)

Questions? • Thank you for your attention. • Acknowledgments: • Prof. Wivine Decoster (our speaker) • Jacques, Leen and other SPACE members • Wesley and other DSSP people • IWT

THE END

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Presentation Transcript

THE SPACE ECONOMY Symposium Internet Routing in Space (IRIS)

FLORIDA’S FUTURE IN SPACE! The Growth of the Commercial Space Industry

The NASA Space Geodesy Project

2009 ESMD Space Grant Faculty Project

Space Project

Space perception and the display of data in space

Project Space

The space …

Space Project!

SYNTHESIS Space-Time Revisited

“The Space”

Current as of: Feb.06, 2009

2010 National Space Symposium

The SPACE project: Speech Algorithms for Clinical and Educational Applications

Synthesis of nucleotides in space

THE NATIONAL SPACE WEATHER PROGRAM Sixth Symposium on Space Weather

Project Space

Space Grant Symposium Presentation

Colorado Space Grant Symposium

Helps In Organizing the Space with Well Utilization of the Space

Hate speech in electronic space Hate speech vs Freedom of speech

The SPACE project: Speech Algorithms for Clinical and Educational Applications