560 likes | 976 Views
FLST: Text-to-Speech Synthesis. Bernd Möbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2013/. Speech synthesis: Ambition and dilemma. Ambition of speech synthesis: modeling the production side of the most complex human cognitive ability
E N D
FLST: Text-to-Speech Synthesis Bernd Möbius moebius@coli.uni-saarland.de http://www.coli.uni-saarland.de/courses/FLST/2013/
Speech synthesis: Ambition and dilemma • Ambition of speech synthesis: • modeling the production side of the most complex human cognitive ability • Dilemma of speech synthesis: • emulate a human speaker or reader, without • world knowledge • language comprehension • speech organs • achieve optimal intelligibility and naturalness • Speech synthesis: an impossible task!?
Mechanical systems Wolfgang von Kempelen (1770)
Mechanical systems Wolfgang von Kempelen (1791): speaking machine http://www.youtube.com/watch?v=zYRVqrfY3tQ
Electrical systems Dudley (1939): the Voder
Formant synthesis Gunnar Fant (1953): OVE I, serial filters John Holmes (1973): parallel filters
Formant synthesis • Acoustic-parametric synthesis • modeling the acoustic properties of speech sounds
Formant synthesis Prof. Stephen Hawking and speech synthesizer (DECtalk DTC01) DecTalk Infovox http://www.youtube.com/watch?v=J-8a55jeR-A (1:13 – 1:32) http://www.youtube.com/watch?v=wlrOKpQ6UBI
Articulatory synthesis Vocal Tract Lab (2007) http://www.vocaltractlab.de/ IP Köln (1995) • Articulatory synthesis • modeling components of the speech production system • voice source, articulators, 3D vocal tract, etc.
Synthesis methods • Acoustic-parametric synthesis • Articulatory synthesis • Concatenative synthesis • uses segments of natural speech, concatenated and resequenced to synthesize the intended utterance • e.g. diphone synthesis, unit selection synthesis, statistical parametric (HMM-based) synthesis
Concatenative synthesis • Data-based, concatenative synthesis • offline: extraction of units from recordings of natural speech • online: selection and sequential concatenation of units • Which units are appropriate? • phones? [Ger: 45]
Concatenative synthesis • Data-based, concatenative synthesis • offline: extraction of units from recordings of natural speech • online: selection and sequential concatenation of units • Which units are appropriate? • phones? [Ger: 45] • diphones? [Ger: 2025]
Hadifix Festival SVOX Bell Labs Diphone synthesis
Concatenative synthesis • Data-based, concatenative synthesis • offline: extraction of units from recordings of natural speech • online: selection and sequential concatenation of units • Which units are appropriate? • phones? [Ger: 45] • diphones? [Ger: 2025] • triphones? [Ger: 91,125] • syllables? [Ger: 12,500+]
Concatenative, diphone synthesis • Synthesis by re-sequencing and concatenating selected units of natural speech (typically: diphones) + units comprise dynamic phone-to-phone transitions + units cover local coarticulatory effects longer-range coarticulation not covered signal processing at least for smoothing concatention signal processing for prosodic modifications compromise between coverage and inventory size • Standard synthesis technique in the 1990s • suboptimal naturalness • stable, predictable quality
Unit selection synthesis • Dynamic selection of units at synthesis run-time • "The best solution to the synthesizer problem is to avoid it." [Carlson & Granström, 1991] • overcome restrictions by a fixed unit inventory • unit inventory: large corpus of recorded natural speech • select the smallest number of the longest units covering the target phone sequence • variable unit size (segments, syllables, words, ...) • reduce perceptual impression of lack of naturalness caused by number of concatenations and signal processing
Unit selection synthesis • Inventory construction off-line and run-time unit selection • preserve natural speech as much as possible • ideal world: target utterance available in corpus • unfortunately: ideal case is extremely improbable, due to complexity/combinatorics of language and speech • however, longer units may be available in corpus • most extreme strategy (CHATR, Black & Taylor 1994 …) • no modification by signal processing • listener will tolerate occasional glitches, if overall synthesis quality approaches natural speech
Unit selection based on cost functions • Minimize two cost functions, simultaneously and globally (viz. for the entire utterance) • target cost (unit distortion): how suitable is the candidate? • concatenation cost (join cost, continuity distortion): how smooth is the concatenation with adjacent units?
Selection algorithm target costs concatenation costs Minimize Ct and Cc[Hunt & Black 1996]
Selection algorithm sequence of target units lattice of candidate units Minimize Ct and Cc[Hunt & Black 1996]
Example: Word-based unit selection I have time on Monday I have time on Monday I have on Monday I on Target utterance: I have time on Monday. Step 1: tabulate all candidate words for target utterance
Example: Word-based unit selection I have time on Monday I have time on Monday S E I have on Monday I on direction of nodes (time) Target utterance: I have time on Monday. Step 1: tabulate all candidate words for target utterance
Example: Word-based unit selection I have time on Monday I have time on Monday S E I have on Monday I on direction of nodes (time) Target utterance: I have time on Monday. Step 1: tabulate all candidate words for target utterance
Speech corpus design and size • Quality of speech corpus (recordings, annotation, coverage) has tremendous effect on synthesis quality • Corpus size is single most important quality factor • Some data points: • IBM/Cambridge: ~60 min. (ASR corpora) • CHATR: phonetically balanced sentences, radio news, isolated sentences: 40 min. (Eng.), 20 min. (Jap.) • "bring a novel … of their own choice" [Campbell 1999] • AT&T 1999: news stories and system prompts, ~2 hrs. • SmartKom: open+closed domains, 160 min. • typical corpus size today: 10++ hrs.
Unit Selection: demos • example speech output from several systems: • CHATR (1996) • AT&T (2001) • Festival (2004) • SmartKom (2005) • Loquendo (2010) • BOSS (pol., 2009)
Unit selection synthesis: Summary • Synthesis by re-sequencing and concatenating units selected at run-time from corpus of natural speech + facilitates long units without concatenation + reduces need for signal processing + preserves natural speech waveforms tends to produce unstable, unpredictable quality inflexible w.r.t. speaking style and speaker voice • Standard synthesis technique in the 2000s • in competition with HMM-based synthesis (statistical parametric speech synthesis, HMM = Hidden Markov models)
Unit selection vs. HMM-based synthesis • Unit selection approach • high-quality speech synthesized by concatenation of natural waveforms • building several voices requires large amount of speech data • HMM-based approach • probabilistic formulation of corpus-based synthesis • generate speech from a model • speech parameters generated from statistics • change of voice quality or speaker ID by transforming HMM parameters based on small amount of data
Statistical parametric synthesis: Summary • HMM-based synthesis system +trainable and flexible + small footprint + smooth and stable speech generation (too smooth?) vocoder-based, buzzy "voice" quality • Research questions • how to parameterize speech waveforms? • how to model extracted parameter trajectories? • how to recover speech parameter trajectories? • how to improve voice source modeling?
TTS: System components text linguistic text analysis prosody control speech synthesis synthetic speech
TTS: Processing tasks Will this course on TTS end on 02-09-2014 at 5:45pm? Endet dieser Kurs, TTS, am 9.2.1999 um 17.45 Uhr? properties of text properties of voice
TTS: Processing tasks Will this course, on TTS, end on 02-09-2014 at 5:45pm? Will this course [comma] on TTS [comma] end on the ninth of February two thousand and fourteen at five fourty-five p m [question mark] _ wIl DIs kors On ti: ti: Es End On D@ naInT @v fEbru@ri: At faIv forti: faIv pi: Em _ ((_ wIl DIs kors) (?On ti: ti: Es) (?End On D@ naInT @v fEbru@ri:) (?At faIv forti: faIv pi: Em _))
TTS: Processing tasks ((_ wIl DIs kors) (?On ti: ti: Es) (?End On D@ naInT @v fEbru@ri:) (?At faIv forti: faIv pi: Em _)) * H- * H- * * * H- * * * H% F0
TTS: Linguistic text analysis text normalization • lexical & morphological analysis • lexicon lookup • morphological analysis • syntactic analysis • prosodic analysis • phrasing • accenting • phonological analysis • pronunciation • syllabification
Morphology: Derivation and Compounding • Problem for TTS: unknown words (i.e. words not explicitly listed in the system's dictionary) • unlimited vocabulary • practically unlimited lists of (e.g.) names • productive word formation processes • productive compounding (e.g. German) • Donaudampfschiffahrtsgesellschaftskapitän • Unerfindlichkeitsunterstellung • Oberweserdampfschiffahrtsgesellschaftskapitänsmützen-beratungsteekränzchen • Morphological analysis of compounds and other "unknown" words is indispensable in TTS
Morphological word model: WFST • Example: decomposing Unerfindlichkeitsunterstellung • (correct) morphological decomposition: un[pref] + er[pref] + f'ind[root] + lich[suff] + keit[suff] + s[fuge] + unter[pref] + st'ell[root] + ung[suff] [#] <3.2> • WFST:
Segment of finite-state grammar for decomposing morphological complex words in German
Syllable model • Approx. 12,500 distinct syllables in English, German (some say >40k) • despite phonotactic restrictions on phone combinations • most syllables are lexically accounted for (names!) • Implementation of syllable model as finite-state grammar (Bell Labs TTS) • syllabification of phone sequences in phonological component • syllable model as part of morphological word model, operating on annotated orthography • (application: hyphenation of orthographic words)
TTS: Prosody control • duration modeling • segmental durations • syllable durations • pause durations • local speaking rate • intonation modeling • phrasing • accenting • amplitude modeling
TTS: Synthesis • concatenative synthesis • unit selection • unit concatenation • or rule-based synthesis • acoustic trajectories • articulatory trajectories • signal generation synthetic speech signal
Required and suggested reading • TTS overview paper: • Robert Clark, Korin Richmond, Simon King (2007): "MultiSyn: Open-domain unit selection eech synthesis system". Speech Communication 49, 317-330. • TTS text book (not required for this class): • Paul Taylor (2009): Text-to-Speech Synthesis. Cambridge University Press.
Exercises (to prepare for Dec 13) TTS systems: overview and quality assessment Look for demo pages of commercial and non-commercial TTS systems onthe Web, in particular systems offering interactive demos. Try to assess the overall quality of these TTS systems. Select two systems for a side-by-side comparison of their performance. Alternatively, select two or three languages rendered by the same TTS system. Try to perform a diagnostic evaluation of TTS system components. Design test sentences to test the performance on different tasks, such as: resolution of complex alphanumeric expressions (e.g. dates, time, currency), pronunciation of names, pronunciation of complex words (e.g. compounds), prosodic phrasing and accenting, sentence mode detection, etc. Take notes of strengths and weaknesses of the systems and try to determine which system component is responsible for certain mistakes.