1 / 25

Speech Synthesis: Then and Now

Explore the evolution of speech synthesis from early speech synthesizers to modern TTS systems. Learn about different synthesis methods and the components of a modern TTS system. Discover the challenges and advancements in producing natural sounding speech.

Download Presentation

Speech Synthesis: Then and Now

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Synthesis: Then and Now Julia Hirschberg CS 4706

  2. Today • Early speech synthesizers • Articulatory synthesis • Formant (acoustic) synthesis • Concatenative synthesis • Components of a Modern TTS System

  3. Synthesizer Components • Front end: From input to control parameters • From acoustic/phonetic representations • From naturally occurring text • From constrained mark-up language • From semantic/conceptual representations • Back end: From control parameters to waveform • Articulatory synthesis • Formant/acoustic synthesis • Concatenative synthesis

  4. The First ‘Speaking Machine’ • Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) • First to produce whole words, phrases – in many languages

  5. Joseph Faber’s Euphonia, 1846

  6. Constructed 1835 w/pedal and keyboard control • Whispered and ordinary speech • Model of tongue, pharyngeal cavity with changeable shape • Singing too “God Save the Queen” • Modern Articulatory Synthesis: Dennis Klatt (1987)

  7. World’s Fair in NY, 1939 • Requires much training to ‘play’ • Purpose: reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line

  8. Answers: • These days a chicken leg is a rare dish. • It’s easy to tell the depth of a well. • Four hours of steady work faced us. • ‘Automatic’ synthesis from spectrogram – but can also use hand-painted spectrograms as input • Purpose: understand perceptual effect of spectral details

  9. Formant/Resonance/Acoustic Synthesis • Parametric or resonance synthesis • Specify minimal parameters, e.g. f0 and first 3 formants • Pass electronic source signal thru filter • Harmonic tone for voiced sounds • Aperiodic noise for unvoiced • Filter simulates the different resonances of the vocal tract • E.g. • Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants • Gunnar Fant’s Orator Verbis Electris (1953) for vowels • Formant synthesis download (demo)

  10. Synthesis by Computer • Beginnings ~1960; dominant from 1970—

  11. Concatenative Synthesis • Most common type today • First practical application in 1936: British Phone company’s Talking Clock • Optical storage for words, part-words, phrases • Concatenated to tell time • E.g. • And a ‘similar’ example • Bell Labs TTS (1977) (1985)

  12. Variants of Concatenative Synthesis • Inventory units • Diphone synthesis (e.g. Festival) • Microsegment synthesis • “Unit Selection” – large, variable units • Issues • How well do units fit together? • What is the perceived acoustic quality of the concatenated units? • Is post-processing on the output possible, to improve quality?

  13. TTS Production Levels: Back End and FrontEnd • Orthographic input: The children read to Dr. Smith • World Knowledge text normalization • Semantics • Syntax word pronunciation • Lexical Intonation assignment • Phonology intonation realization • F0, amplitude, duration • Acoustics synthesis

  14. Text Normalization • Reading is what W. hates most. • Reading is what Wilde hated most. • Have the students read the questions. • In 1996 she sold 1995 shares and deposited $42 in her 401(k). • The duck dove supply.

  15. Pronunciation in Context

  16. Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…

  17. Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?

  18. Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know?

  19. Phonological and Acoustic Realization • Task: • Produce a phonological representation from phonemes and intonational assignment • Pitch contour aligned with text • Durations, intensity • Select best concatenative units from inventory • Post-process if needed/possible to smooth joins, modify pitch, duration, intensity, rate from original units • Produce acoustic waveform as output

  20. TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:

  21. Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices? • ScanSoft/Nuance demo

  22. Next Week • Pronunciation Modeling for speech synthesis • Hwk 2 due

More Related