260 likes | 362 Views
Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009. Yuk On Kong, Lukas Latacz, Werner Verhelst Laboratory for Digital Speech and Audio Processing Vrije Universiteit Brussel. Introduction. To Record or Not to Record: That’s the question.
E N D
Speech Synthesis in the SPACE Reading TutorClosing Symposium of the SPACE Project06 FEB 2009 Yuk On Kong, Lukas Latacz, Werner VerhelstLaboratory for Digital Speech and Audio ProcessingVrije Universiteit Brussel
To Record or Not to Record: That’s the question. • Pre-recorded speech in existing reading tutors • Advantages / disadvantages?
Application-specific TTS • Speaker / voice • Material in speech corpus • How to synthesize • Any extra mode necessary? • the child is too slow… • How to maximize quality
Speaker / Voice Speaker • Appealing to children • Female speaker • Standard Flemish pronunciation, no noticeable regional accent • Experienced speaker
Material inSpeech Corpus Database (about 6 hours) • Material from stories for children • Words expected at 6 years of age • Diphones
How to synthesize • Based on the general unit selection paradigm. • Heterogeneous units: units could be of various sizes • Bases: • Use of longer chunks leads to quality improvement. • Used for synthesizing domain-specific utterances. oma o ma _-o o-m m-a Fig. Word “oma” to synthesize and multi-tier segmentation in word, syllable and segment
How to synthesize • Basic algorithm: • Search top-down and select longest sequence of targets at each level and go to lower levels if no candidates are found. • Coarticulation: • Even across word boundaries • Level: diphone, syllable, word, phrase
How to synthesize • Front-end Als het flink vriest, kunnen we schaatsen. Tokenisation Text Normalisation Phrase and Pause Prediction Part of speech Word Pronunciation Silence Insertion ToDI Intonation Word Accent Back-end Unit Selection Unit Concatenation Speech DB
How to synthesize Those with a * are also calculated for the neighboring segments, syllables or words. “Neighboring syllables” are restricted to the syllables of the current word. As for segments & words, three neighbors on the left and three on the right are taken into account. Target prosody is described symbolically Best sequence of units is selected • Weighted sum of target and join costs • Viterbi search Joins: • Costs based on spectrum, pitch, energy, duration and adjacency • PSOLA-based algorithm with optimal coupling
Extra Modes? Phoneme-by-phoneme mode • Stress Syllable mode
Extra Modes? Demonstration: • Phoneme-by-phoneme • Stress on first phoneme • Syllable • Normal mode
The Child is Too Slow… Choosing the appropriate reading speed for the child • Uniform WSOLA time-scaling • Insertion of additional silences between neighboring words Reading along
The Child is Too Slow… Synthesizer Assessment Errordetection Tracking Teacher’smodule Synthesis module Readingtutor Playback module Commands & Timing Info Audio Cygwin Windows XP
How to Maximize Quality Major synthesis problems • Join artifacts • Inappropriate prosody Interactive tuning of synthesis • Assisted by quality management • User can make small changes to the input text
How to Maximize Quality Approach: • For each word, calculate average target and join costs • Predictor: • : threshold based on max and min of cost c • uj usually lies between 0 & 1 because of training settings. • Accept if uj < 0.5 and reject otherwise. • Weights: linear regression • Best alphas found iteratively (maximizing f-score)
Other Special Aspects • Phrase and Silence Prediction • Context-dependent Weight Training
Phrase and Silence Prediction Type of pauses: heavy, medium and light • Phrase breaks: both heavy and medium pauses Training • No manual labeling, but based on the pauses automatically labeled in the speech database • Iterative classification based on these pauses • Training of memory-based learner (features such as POS, punctuation, ...)
Context-dependent Weight Training Automatic adaptation (tuning) of weights Context-dependent weights • Context is described symbolically per phone Training: • Optimizing weights • Clustering optimized weights (decision trees)
Context-dependent Weight Training 7 subjects 4 conditions • Randomly selected corpus; Context-dependent weights • Randomly selected corpus; Untrained weights • Corpus selected based on word frequency; Context-dependent weights • Corpus selected based on word frequency; Untrained weights 25 test utterances, AVI 1-5 (5 utt./level) Results:
Demonstration Hierarchical unit selection: • AVI 1: “Dit is te gek, gilt ze.” • AVI 3: “Toch had hij liever de hond gehad.” • AVI 5: “Roel ligt nog een paar dagen in het ziekenhuis.” • AVI 7: “De kleine huizen staan dicht tegen elkaar aan.” • AVI 9: “Nou Henk, zie je nu wel dat je moeder hier fantastisch verzorgd wordt!”
WSOLA Top: original signalBottom: WSOLA time-scaling Illustration of the WSOLA strategy
Other Application • Audio-visual TTS • Example: “The sentence you hear is made out of many combinations of original sound and video, selected from the recordings of natural speech.” • Database containing about 20 minutes (LIPS Challenge ’08) • For better audio quality, the database should be much larger
Future Work • Optimizing synthesis • User feedback • Expressive speech synthesis • Automated prosodic annotations • Quality Management • Evaluation & optimization of the algorithm • Compare with the perceived quality of synthesized sentences (MOS)
Questions? • Thank you for your attention. • Acknowledgments: • Prof. Wivine Decoster (our speaker) • Jacques, Leen and other SPACE members • Wesley and other DSSP people • IWT