600 likes | 1.07k Views
Speech production mechanisms. Spch Prod. Esophagus. Speech Production Organs. Brain. Hard Palate. Nasal cavity. Velum. Teeth. Lips. Uvula. Mouth cavity. Pharynx. Tongue. Larynx. Trachea. Lungs. Spch Prod. Speech Production Organs - cont.
E N D
Speech production mechanisms
Spch Prod Esophagus Speech Production Organs Brain Hard Palate Nasal cavity Velum Teeth Lips Uvula Mouth cavity Pharynx Tongue Larynx Trachea Lungs
Spch Prod Speech Production Organs - cont. • Air from lungs is exhaled into trachea (windpipe) • Vocal chords (folds) in larynx can produce periodic pulses of air by opening and closing (glottis) • Throat (pharynx), mouth, tongue and nasal cavity modify air flow • Teeth and lips can introduce turbulence • Epiglottis separates esophagus (food pipe) from trachea
Spch Prod Voiced vs. Unvoiced Speech • When vocal cords are held open air flows unimpeded • When laryngeal muscles stretch them glottal flow is in bursts • When glottal flow is periodic called voiced speech • Basic interval/frequency called the pitch • Pitch period usually between 2.5 and 20 milliseconds Pitch frequency between 50 and 400 Hz You can feel the vibration of the larynx • Vowels are always voiced (unless whispered) • Consonants come in voiced/unvoiced pairs for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH
Spch Prod Excitation spectra • Voiced speech Pulse train is not sinusoidal - harmonic rich • Unvoiced speech Common assumption : white noise f f
Spch Prod Effect of vocal tract • Mouth and nasal cavities have resonances • Resonant frequencies depend on geometry
Spch Prod F1 F2 F3 F4 voiced speech unvoiced speech Effect of vocal tract - cont. • Sound energy at these resonant frequencies is amplified • Frequencies of peak amplification are called formants frequency response frequency F0
Spch Prod Formant frequencies • Peterson - Barney data (note the “vowel triangle”)
Spch Prod Sonograms
Spch Prod Cylinder model(s) Rough model of throat and mouth cavity With nasal cavity Voice Excitation open open Voice Excitation open/closed
Spch Prod Phonemes • The smallest acoustic unit that can change meaning • Different languages have different phoneme sets • Types: (notations: phonetic, CVC, ARPABET) • Vowels • front (heed, hid, head, hat) • mid (hot, heard, hut, thought) • back (boot, book, boat) • dipthongs (buy, boy, down, date) • Semivowels • liquids (w, l) • glides (r, y)
Spch Prod Phonemes - cont. • Consonants • nasals (murmurs) (n, m, ng) • stops (plosives) • voiced (b,d,g) • unvoiced (p, t, k) • fricatives • voiced (v, that, z, zh) • unvoiced (f, think, s, sh) • affricatives (j, ch) • whispers (h, what) • gutturals ( ח,ע) • clicks, etc. etc. etc.
Spch Prod Basic LPC Model Pulse Generator U/V Switch LPC synthesis filter White Noise Generator
Spch Prod Basic LPC Model - cont. • Pulse generator produces a harmonic rich periodic impulse train (with pitch period and gain) • White noise generator produces a random signal (with gain) • U/V switch chooses between voiced and unvoiced speech • LPC filter amplifies formant frequencies (all-pole or AR IIR filter) • The output will resemble true speech to within residual error
Spch Prod Cepstrum Another way of thinking about the LPC model Speech spectrum is the obtained from multiplication Spectrum of (pitch) pulse train times Vocal tract (formant) frequency response So log of this spectrum is obtained from addition Log spectrum of pitch train plus Log of vocal tract frequency response Consider this log spectrum to be the spectrum of some new signal called the cepstrum The cepstrum is the sum of two components: excitation plus vocal tract
Spch Prod Cepstrum - cont. Cepstral processing has its own language • Cepstrum (note that this is really a signal in the time domain) • Quefrency (its units are seconds) • Liftering (filtering) • Alanysis • Saphe Several variants: • complex cepstrum • power cesptrum • LPC cepstrum
Spch Prod Do we know enough? Standard speech model (LPC) (used by most speech processing/compression/recognition systems) is a model of speech production Unfortunately, speech production and speech perception systems are not matched So next we’ll look at the biology of the hearing (auditory) system and some psychophysics (perception)
Speech Hearing & PerceptionMechanisms
Spch Perc Hearing Organs
Spch Perc Hearing Organs - cont. • Sound waves impinge on outer ear enter auditory canal • Amplified waves cause eardrum to vibrate • Eardrum separates outer ear from middle ear • The Eustachian tube equalizes air pressure of middle ear • Ossicles (hammer, anvil, stirrup) amplify vibrations • Oval window separates middle ear from inner ear • Stirrup excites oval window which excites liquid in the cochlea • The cochlea is curled up like a snail • The basilar membrane runs along middle of cochlea • The organ of Corti transduces vibrations to electric pulses • Pulses are carried by the auditory nerve to the brain
Spch Perc Function of Cochlea • Cochlea has 2 1/2 to 3 turns were it straightened out it would be 3 cm in length • The basilar membrane runs down the center of the cochlea as does the organ of Corti • 15,000 cilia (hairs) contact the vibrating basilar membrane and release neurotransmitter stimulating 30,000 auditory neurons • Cochlea is wide (1/2 cm) near oval window and tapers towards apex • is stiff near oval window and flexible near apex • Hence high frequencies cause section near oval window to vibrate low frequencies cause section near apex to vibrate • Overlapping bank of filter frequency decomposition
Spch Perc Psychophysics - Weber’s law Ernst Weber Professor of physiology at Leipzig in the early 1800s Just Noticeable Difference : minimal stimulus change that can be detected by senses Discovery: D I = K I Example Tactile sense: place coins in each hand subject could discriminate between with 10 coins and 11, but not 20/21, but could 20/22! Similarlyvisionlengths of lines, tastesaltiness, soundfrequency
Spch Perc Weber’s law - cont. This makes a lot of sense Bill Gates
Spch Perc Psychophysics - Fechner’s law Weber’s law is not a truepsychophysicallaw it relates stimulus threshold to stimulus (both physical entities) not internal representation (feelings) to physical entity Gustav Theodor Fechner student of Webermedicine, physics philosophy Simplest assumption: JND is single internal unit Using Weber’s law we find: Y = A log I + B Fechner Day (October 22 1850)
Spch Perc Fechner’s law - cont. Log is very compressive Fechner’s law explains the fantastic ranges of our senses Sight:single photon - direct sunlight 1015 Hearing: eardrum move 1 H atom - jet plane 1012 Beldefined to be log10 of power ratio decibel (dB)one tenth of a Bel d(dB) = 10 log10 P 1 / P 2
Spch Perc Fechner’s law - sound amplitudes Companding adaptation of logarithm to positive/negative signals m-lawandA-laware piecewise linear approximations Equivalent to linear sampling at 12-14 bits (8 bit linear sampling is significantly more noisy)
Spch Perc 12 2 Fechner’s law - sound frequencies octaves,well tempered scale Critical bands Frequency warping Melody 1 KHz = 1000, JND afterwards M ~ 1000 log2 ( 1 + fKHz ) Barkhausen can be simultaneously heard B ~ 25 + 75 ( 1 + 1.4 f2KHz )0.69 excite different basilar membrane regions f
Spch Perc Inverse E Filter Psychophysics - changes Our senses respond to changes
Spch Perc Psychophysics - masking Masking: strong tones block weaker ones at nearby frequencies narrowband noise blocks tones (up to critical band) f
Some Speech DSP Simplest processing • Gain • AGC • VAD More complex processing • pitch tracking • U/V decision • computing LPC • other features
Spch DSP Gain (volume) Control In analog processing (electronics) gain requires an amplifier Great care must be taken to ensure linearity! In digital processing (DSP) gain requires only multiplication y = G x Need enough bits!
Spch DSP Automatic Gain Control (AGC) Can we set the gain automatically? Yes, based on the signal’s Energy! E = x2 (t) dt = S xn2 All we have to do is apply gain until attain desired energy Assume we want the energy to be Y Then y = Y/ E x = G x has exactly this energy
Spch DSP AGC - cont. What if the input isn’t stationary (gets stronger and weaker over time) ? The energy is defined for all times- < t < so it can’t help! So we define “energy in window” E(t) and continuously vary gain G(t) This is Adaptive Gain Control We don’t want gain to jump from window to window so we smooth the instantaneous gain G(t)a G(t) + (1-a) Y/E(t) IIR filter 8 8
Spch DSP AGC - cont. Theacoefficient determines how fastG(t)can change In more complex implementations we may separately control integration time, attack time, release time What is involved in the computation ofG(t)? • Squaring of input value • Accumulation • Square root (or Pythagorean sum) • Inversion (division) Square root and inversion are hard for a DSP processor but algorithmic improvements are possible (and often needed)
Spch DSP Simple VAD Sometimes it is useful to know whether someone is talking (or not) • Save bandwidth • Suppress echo • Segment utterances We might be able to get away with “energy VOX” Normally need Noise Riding Threshold / Signal Riding Threshold However, there are problems energy VOX since it doesn’t differentiate between speech and noise What we really want is a speech-specific activity detector Voice Activity Detector
Spch DSP Simple VAD - cont. VADs operate by recognizing that speech is different from noise • Speech is low-pass while noise is white • Speech is mostly voiced and so has pitch in a given range • Average noise amplitude is relatively constant A simple VAD may use: • zero crossings • zero crossing “derivative” • spectral tilt filter • energy contours • combinations of the above
Spch DSP Other “simple” processes Simple = not significantly dependent on details of speech signal • Speed change of recorded signal • Speed change with pitch compensation • Pitch change with speed compensation • Sample rate conversion • Tone generation • Tone detection • Dual tone generation • Dual tone detection (need high reliability)
Complex Speech DSP
Spch DSP Correlation One major difference between simple and complex processing is the computation of correlations (related to LPC model) Correlation is a measure of similarity Shouldn’t we use squared difference to measure similarity? D2 = < (x(t) - y(t))2> No, since squared difference is sensitive to • gain • time shifts
Spch DSP Correlation - cont. D2 = < (x(t) - y(t))2>= < x2 > + < y2 > - 2 <x(t) y(t)> So whenD2 is minimal C(0) = <x(t) y(t)> is maximal and arbitrary gains don’t change this To take time shifts into account C(t) = <x(t) y(t+t)> and look for maximal t ! We can even find out how much a signal resembles itself
Spch DSP Autocorrelation CrosscorrelationCx y (t) = <x(t) y(t+t)> AutocorrelationCx (t) = <x(t) x(t+t)> Cx (0) is the energy! Autocorrelation helps find hidden periodicities! Much stronger than looking in the time representation Wiener Khintchine Autocorrelation C(t) and Power Spectrum S(f) are FT pair So autocorrelation contains the same information as the power spectrum … and can itself be computed by FFT
Spch DSP Pitch tracking How can we measure (and track) the pitch? We can look for it in the spectrum • but it may be very weak • may not even be there (filtered out) • need high resolution spectral estimation Correlation based methods The pitch periodicity should be seen in the autocorrelation! Sometimes computationally simpler is the Absolute Magnitude Difference Function < | x(t) - x(t+t) |>
Spch DSP Pitch tracking - cont. Sondhi’s algorithm for autocorrelation-based pitch tracking : • obtain window of speech • determine if the segment is voiced (see U/V decision below) • low-pass filter and center-clip to reduce formant induced correlations • compute autocorrelation lags corresponding to valid pitch intervals • find lag with maximum correlation OR • find lag with maximal accumulated correlation in all multiples Post processing Pitch trackers rarely makesmallerrors(usually double pitch) So correct outliers based on neighboring values
Spch DSP Other Pitch Trackers Miller’s data-reduction& Gold and Rabiner’s parallel processing methods Zero-crossings, energy, extrema of waveform Noll’s cepstrum based pitch tracker Since the pitch and formant contributions are separated in cepstral domain Most accurate for clean speech, but not robust in noise Methods based on LPC error signal LPC technique breaks down at pitch pulse onset Find periodicity of error by autocorrelation Inverse filtering method Remove formant filtering by low-order LPC analysis Find periodicity of excitation by autocorrelation Sondhi-like methods are the best for noisy speech
Spch DSP U/V decision Between VAD and pitch tracking • Simplest U/V decision is based on energy and zero crossings • More complex methods are combined with pitch tracking • Methods based on pattern recognition Is voicing well defined? • Degree of voicing (buzz) • Voicing per frequency band (interference) • Degree of voicing per frequency band
Spch DSP LPC Coefficients How do we find the vocal tract filter coefficients? System identification problem • All-pole (AR) filter • Connection to prediction Sn = G en+ Sm am sn-m Can find G from energy (so let’s ignore it) Unknown filter known input known output
Spch DSP LPC Coefficients For simplicity let’s assume threeacoefficients Sn = en+ a1 sn-1 + a 2 s n-2 + a 3 s n-3 Need three equations! Sn= en+ a1 sn-1 + a 2 s n-2 + a 3 s n-3 Sn+1 = en+1+ a1 sn+ a 2 s n-1 + a 3 s n-2 Sn+2 = en+2+ a1 sn+1 + a 2 s n + a 3 s n-1 In matrix form Snensn-1 s n-2 s n-3a1 Sn+1 = en+1+ sns n-1 s n-2a 2 Sn+2 en+2sn+1 s n s n-1a 3 s = e + S a
Spch DSP LPC Coefficients - cont. S = e + S a so by simple algebra a = S-1 ( s - e ) and we have reduced the problem to matrix inversion Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm) Unfortunately noise makes this attempt break down! Move to next time and the answer will be different. Need to somehow average the answers The proper averaging is before the equation solving correlation vs autocovariance