380 likes | 563 Views
Sonorant Grab Bag. March 27, 2014. Speech Synthesis: A Basic Overview. Speech synthesis is the generation of speech by machine. The reasons for studying synthetic speech have evolved over the years: Novelty To control acoustic cues in perceptual studies
E N D
Sonorant Grab Bag March 27, 2014
Speech Synthesis:A Basic Overview • Speech synthesis is the generation of speech by machine. • The reasons for studying synthetic speech have evolved over the years: • Novelty • To control acoustic cues in perceptual studies • To understand the human articulatory system • “Analysis by Synthesis” • Practical applications • Reading machines for the blind, navigation systems
Speech Synthesis:A Basic Overview • There are four basic types of synthetic speech: • Mechanical synthesis • Formant synthesis • Based on Source/Filter theory • Concatenative synthesis • = stringing bits and pieces of natural speech together • Articulatory synthesis • = generating speech from a model of the vocal tract.
1. Mechanical Synthesis • The very first attempts to produce synthetic speech were made without electricity. • = mechanical synthesis • In the late 1700s, models were produced which used: • reeds as a voicing source • differently shaped tubes for different vowels
Mechanical Synthesis, part II • Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device… • with independently manipulable source and filter mechanisms.
Mechanical Synthesis, part III • An interesting historical footnote: • Alexander Graham Bell and his “questionable” experiments with his dog. • Mechanical synthesis has largely gone out of style ever since. • …but check out Mike Brady’s talking robot.
The Voder • The next big step in speech synthesis was to generate speech electronically. • This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder. • The Voder was a manually controlled speech synthesizer. • (operated by highly trained young women)
Voder Principles • The Voder basically operated like a vocoder. • Voicing and fricative source sounds were filtered by 10 different resonators… • each controlled by an individual finger! • Only about 1 in 10 had the ability to learn how to play the Voder.
Overtone Singing • F0 stays the same (on a “drone”), while singer shapes the vocal tract so that individual harmonics (“overtones”) resonate. • What kind of voice quality would be conducive to this?
Vowels and Sonorants • So far, we’ve talked a lot about the acoustics of vowels: • Source: periodic openings and closings of the vocal folds. • Filter: characteristic resonant frequencies of the vocal tract (above the glottis) • Today, we’ll talk about the acoustics of sonorants: • Nasals • Laterals • Approximants • The source/filter characteristics of sonorants are similar to vowels… with a few interesting complications.
Damping • One interesting acoustic property exhibited by (some) sonorants is damping. • Recall that resonance occurs when: • a sound wave travels through an object • that sound wave is reflected... • ...and reinforced, on a periodic basis • The periodic reinforcement sets up alternating patterns of high and low air pressure • = a standing wave
Resonance in a closed tube t i m e
Damping, schematized • In a closed tube: • With only one pressure pulse from the loudspeaker, the wave will eventually dampen and die out. • Why? • The walls of the tube absorb some of the acoustic energy, with each reflection of the standing wave.
Damping Comparison • A heavily damped wave wil die out more quickly... • Than a lightly damped wave:
Damping Factors • The amount of damping in a tube is a function of: • The volume of the tube • The surface area of the tube • The material of which the tube is made • More volume, more surface area = more damping • Think about the resonant characteristics of: • a Home Depot • a post-modern restaurant • a movie theater • an anechoic chamber
Resonance and Recording • Remember: any room will reverberate at its characteristic resonant frequencies • Hence: high quality sound recordings need to be made in specially designed rooms which damp any reverberation • Examples: • Classroom recording (29 dB signal-to-noise ratio) • “Soundproof” booth (44 dB SNR) • Anechoic chamber (90 dB SNR)
Spectrograms classroom “soundproof” booth
Spectrograms anechoic chamber
Inside Your Nose • In nasals, air flows through the nasal cavities. • The resonating “filter” of nasal sounds therefore has: • increased volume • increased surface area • increased damping • Note: • the exact size and shape of the nasal cavities varies wildly from speaker to speaker.
Nasal Variability • Measurements based on MRI data (Dang et al., 1994)
Damping Effects, part 1 • Damping by the nasal cavities decreases the overall amplitude of the sound coming out through the nose. [m] [m]
Damping Effects, part 2 • How might the power spectrum of an undamped wave: • Compare to that of a damped wave? • A: Undamped waves have only one component; • Damped waves have a broader range of components.
Here’s Why 100 Hz sinewave + 90 Hz sinewave + 110 Hz sinewave
The Result 90 Hz + 100 Hz + 110 Hz • If the 90 Hz and 110 Hz components have less amplitude than the 100 Hz wave, there will be less damping:
Damping Spectra light medium
Damping Spectra heavy • Damping increases the bandwidth of the resonating filter. • Bandwidth = the range of frequencies over which a filter will respond at .707 of its maximum output. • Nasal formants will have a larger bandwidth than vowel formants.
Bandwidth in Spectrograms F3 of F3 of [m] The formants in nasals have increased bandwidth, in comparison to the formants in vowels.
Nasal Formants • The values of formant frequencies for nasal stops can be calculated according to the same formula that we used for to calculate formant frequencies for an open tube. • fn = (2n - 1) * c • 4L • The simplest case: uvular nasal . • The length of the tube is a combination of: • distance from glottis to uvula (9 cm) • distance from uvula to nares (12.5 cm) • An average tube length (for adult males): 21.5 cm
The Math 12.5 cm • fn = (2n - 1) * c • 4L • L = 21.5 cm • c = 35000 cm/sec • F1 = 35000 • 86 • = 407 Hz • F2 = 1221 Hz • F3 = 2035 Hz 9 cm
The Real Thing • Check out Peter’s production of an uvular nasal in Praat. • And also Dustin’s neutral vowel! • Note: the higher formants are low in amplitude • Some reasons why: • Overall damping • “Nostril-rounding” reduces intensity • Resonance is lost in the side passages of the sinuses. • Nasal stops with fronter places of articulation also have anti-formants.
Anti-Formants • For nasal stops, the occlusion in the mouth creates a side cavity. • This side cavity resonates at particular frequencies. • These resonances absorb acoustic energy in the system. • They form anti-formants
Anti-Formant Math • Anti-formant resonances are based on the length of the vocal tract tube. • For [m], this length is about 8 cm. 8 cm • fn = (2n - 1) * c • 4L L = 8 cm AF1 = 35000 / 4*8 = 1094 Hz AF2 = 3281 Hz etc.
Spectral Signatures • In a spectrogram, acoustic energy lowers--or drops out completely--at the anti-formant frequencies. anti-formants
Nasal Place Cues • At more posterior places of articulation, the “anti-resonating” tube is shorter. • anti-formant frequencies will be higher. • for [n], L = 5.5 cm • AF1 = 1600 Hz • AF2 = 4800 Hz • for , L = 3.3 cm • AF1 = 2650 Hz • for , L = 2.3 cm • AF1 = 3700 Hz
[m] vs. [n] [m] [e] [n] [o] AF1 (n) AF1 (m) • Production of [meno], by a speaker of Tsonga • Tsonga is spoken in South Africa and Mozambique
Nasal Stop Acoustics: Summary • Here’s the general pattern of what to look for in a spectrogram for nasals: • Periodic voicing. • Overall amplitude lower than in vowels. • Formants (resonance). • Formants have broad bandwidths. • Low frequency first formant. • Less space between formants. • Higher formants have low amplitude.
Perceiving Nasal Place • Nasal “murmurs” do not provide particularly strong cues to place of articulation. • Can you identify the following as [m], [n] or ? • Repp (1986) found that listeners can only distinguish between [n] and [m] 72% of the time. • Transitions provide important place cues for nasals. • Repp (1986): 95% of nasals identified correctly when presented with the first 10 msec of the following vowel. • Can you identify these nasal + transition combos?