250 likes | 270 Views
Explore the evolution of speech synthesis from early speech synthesizers to modern TTS systems. Learn about different synthesis methods and the components of a modern TTS system. Discover the challenges and advancements in producing natural sounding speech.
E N D
Speech Synthesis: Then and Now Julia Hirschberg CS 4706
Today • Early speech synthesizers • Articulatory synthesis • Formant (acoustic) synthesis • Concatenative synthesis • Components of a Modern TTS System
Synthesizer Components • Front end: From input to control parameters • From acoustic/phonetic representations • From naturally occurring text • From constrained mark-up language • From semantic/conceptual representations • Back end: From control parameters to waveform • Articulatory synthesis • Formant/acoustic synthesis • Concatenative synthesis
The First ‘Speaking Machine’ • Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) • First to produce whole words, phrases – in many languages
Constructed 1835 w/pedal and keyboard control • Whispered and ordinary speech • Model of tongue, pharyngeal cavity with changeable shape • Singing too “God Save the Queen” • Modern Articulatory Synthesis: Dennis Klatt (1987)
World’s Fair in NY, 1939 • Requires much training to ‘play’ • Purpose: reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line
Answers: • These days a chicken leg is a rare dish. • It’s easy to tell the depth of a well. • Four hours of steady work faced us. • ‘Automatic’ synthesis from spectrogram – but can also use hand-painted spectrograms as input • Purpose: understand perceptual effect of spectral details
Formant/Resonance/Acoustic Synthesis • Parametric or resonance synthesis • Specify minimal parameters, e.g. f0 and first 3 formants • Pass electronic source signal thru filter • Harmonic tone for voiced sounds • Aperiodic noise for unvoiced • Filter simulates the different resonances of the vocal tract • E.g. • Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants • Gunnar Fant’s Orator Verbis Electris (1953) for vowels • Formant synthesis download (demo)
Synthesis by Computer • Beginnings ~1960; dominant from 1970—
Concatenative Synthesis • Most common type today • First practical application in 1936: British Phone company’s Talking Clock • Optical storage for words, part-words, phrases • Concatenated to tell time • E.g. • And a ‘similar’ example • Bell Labs TTS (1977) (1985)
Variants of Concatenative Synthesis • Inventory units • Diphone synthesis (e.g. Festival) • Microsegment synthesis • “Unit Selection” – large, variable units • Issues • How well do units fit together? • What is the perceived acoustic quality of the concatenated units? • Is post-processing on the output possible, to improve quality?
TTS Production Levels: Back End and FrontEnd • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Lexical Intonation assignment • Phonology intonation realization • F0, amplitude, duration • Acoustics synthesis
Text Normalization • Reading is what W. hates most. • Reading is what Wilde hated most. • Have the students read the questions. • In 1996 she sold 1995 shares and deposited $42 in her 401(k). • The duck dove supply.
Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…
Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?
Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know?
Phonological and Acoustic Realization • Task: • Produce a phonological representation from phonemes and intonational assignment • Pitch contour aligned with text • Durations, intensity • Select best concatenative units from inventory • Post-process if needed/possible to smooth joins, modify pitch, duration, intensity, rate from original units • Produce acoustic waveform as output
TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:
Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices? • ScanSoft/Nuance demo
Next Week • Pronunciation Modeling for speech synthesis • Hwk 2 due