1 / 54

Human Speech Communication

message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message. Human Speech Communication. PCM (Pulse Code Modulation). Transmit value of each speech sample

alicia
Download Presentation

Human Speech Communication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message Human Speech Communication

  2. PCM (Pulse Code Modulation) • Transmit value of each speech sample • dynamic range of speech is about 50-60 dB • 11 bits/sample • maximum frequency in telephone speech is 3.4 kHz • sampling frequency 8 kHz 8000 x 11 = 88 kb/s Simple and universal but not very efficient

  3. IN OUT Better quantization ? • Less quantization noise for weaker signals

  4. m - law A - law Logarithmic PCM (m-law, A-law) • Finer quantization for each individual small amplitude sample • how about small signal samples surrounded by large ones? • it is the instantaneous signal energy which should determine the step

  5. ?

  6. Differential coding current sample • For many natural signals, the difference between successive samples quantizes better than samples themselves • Even better, predict the current sample from the past ones and transmit the error of the prediction time

  7. Differential predictive coding • DPCM • a single predictor reflecting global predictability of speech • predictor order up to 4-5 • delta modulation - gross quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate) • adaptive DPCM • new predictor for every new speech block • predictor needs to be transmitted together with the prediction error

  8. Speech Coders

  9. Linear model of speech production

  10. A.G. Bell got it almost right

  11. linear model of speech source filter speech changes slowly

  12. long-term prediction current sample short-term prediction time short-term - resonance of vocal tract long-term - periodicity of voiced speech (vocal cord vibration)

  13. LPC vocoder • The same principle as in H. Dudley’s Vocoder • Used by US Government (LPC-10) - 2.4 kbs

  14. Residual Excited LPC (RELP) • Transmitter: • Simplify prediction error (low-pass filter and down-sample • Receiver • re-introduce high frequencies in the simplified residual (nonlinear distortion)

  15. Analysis-by-synthesis • Identical synthesizer in coder and in decoder • change parameters in coder • use for synthesizing speech • compare synthesized speech with real speech • when “close enough”, send parameters to the receiver

  16. Future in speech coding? • No need to transmit what we do not hear • study human hearing, especially masking • No need to transmit what is predictable • speech production mechanism • speaker characteristics • linguistic code (recognition-synthesis) • thought-to-speech

  17. reduce information = decrease entropy electric signal (more than 50 kb/s) prior knowledge ( textbook ) acquired knowledge ( data ) Automatic recognition of speech knowledge phoneme string (below 50 b/s) linguistic message • Automatic speech recognition (ASR) • derive proper response from speech stimulus • Auditory perception • how do biological systems respond to acoustic stimuli • Knowledge of auditory perception ?

  18. Principle of stochastic ASR • Using a model of speech production process, generate all possible acoustic sequences wi for all legal linguistic messages • Compare all generated sequences with the unknown acoustic input x to find which one is the most similar • What is the model M ( wi )? • Form of the datax ?

  19. h e l o w o r l d u One (simple) model hello world • Two dominant sources of variability in speech • people say the same thing with different speeds ( temporal variability ) • different people sound different, communication environment different, ( feature variability) • “Doubly stochastic” process (Hidden Markov Model) • Speech as a sequence of hidden states - recover the state sequence • never know for sure in which state we are • never know for sure which data can be generated from a given state

  20. hi hi hi hi hi hi hi hi hi hi m f m m m m m f m m pm pf P(sound|gender) The model pm-f m p1m f m f f0 pf-m Hidden Markov Model f0=160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz sequence of male and female groups?

  21. 160 170 160 170 200 110 140 240 170 190 f m m f m x units of speech (phonemes) What the x shouldbe ?

  22. Speech signal ? • always also carry some irrelevant information • additional processing is necessary to alleviate it • Reflects changes in acoustic pressure • its original purpose is reconstruction of speech • does carry relevant information

  23. speech signal histogram correlations

  24. Isaac Newton averaged fft spectra of some vowels from 3 hours of fluent speech l/4 beer /uw//ao//ah//eh//ih//iy/ Where Is The Message ? /u/ /o/ /a/ /e/ /iy/ • it is in the spectrum !!

  25. Steam Engine (1769) Internal Combustion Engine (2003) Inertia in engineering

  26. time frequency get spectral components time Short-term Spectrum 10-20 ms /j/ /u/ /ar/ /j/ /o/ /j/ /o/

  27. short-term speech spectral envelope histogram correlations

  28. logarithmic short-term speech spectral envelope histogram correlations

  29. cosine transform of logarithmic short-term speech spectral envelope(cepstrum) histogram correlations

  30. short-term spectrum frequency auditory-like modifications “auditory-like” spectrum What Is Wrong With the Short-term Spectrum ? 1) inconsistent (same message, different representation)

  31. Frequency resolution of human ear decreases with frequency Pitch of the tone (Mel scale)

  32. t FFT f S “critical-band energy” Emulating frequency resolution of human ear with FFT

  33. Equal Loudness Curves

  34. Perceptual Linear Prediction (PLP)[Hermansky 1990] • Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model • critical-band spectral resolution • equal-loundness sensitivity • intensity-loudness nonlinearity • Today applied in virtually all state-of-the-art experimental ASR systems

  35. Spectral Basis from LDA LDA gives basis for projection of spectral space frequency /j/ /u/ /ar/ /j/ /o/ /j/ /o/ time

  36. 16 % 63 % 2 % 12 % LDA vectors from Fourier Spectrum Spectral resolution of LDA-derived spectral basis is higher at low frequencies Critical bands of human hearing are narrower at lower frequencies

  37. Sensitivity to Spectral Change(Malayath 1999) Cosine basis LDA-derived bases Critical-band filterbank

  38. if the receiver could be controlled put more resources (introduce less noise) where there is more signal biological system optimized for information extraction from sensory signals Combination of channel and signal spectrum should be as flat (as random-like) as possible. Shannon, Communication in presence of noise (1949) energy of the signal level of noise in the channel level of noise in the channel energy of the signal resource space resource space if signal could be controlled (e.g. in communication) • put more signal where there is less noise • sensory signal optimized for a given communication channel

  39. f spectrum additive band-limited noise linear (high-pass) filtering f What Is Wrong With the Short-term Spectral Envelope? 2) Fragile (easily corrupted by minor disturbances) f ignore the noisy parts of the spectrum remove means from parts of the spectrum

  40. critical bandwidth Simultaneous Masking band-pass filtered noise centered at f • Nonlinear frequency resolution of hearing • Critical bands • up to ~600 Hz constant bandwidth • above 1 kHz constant Q tone at f threshold of perception of the tone noise bandwidth

  41. Replace spectral vector by a matrix of posterior probabilities of acoustic events S ( frequency ) {p(f)} pf1 pf2 pf3 pf4 pf5 pf6 frequency ( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 ) More Important Outcome of Masking Experiments • What happens outside the critical band does not affect detection of events within the band !!! • Independent processing of parts of the spectrum ?

  42. h e l o w o r l d u human auditory perception coarticulation What Is Wrong With the Short-term Spectral Envelope? 3) Coarticulation (inertia of organs of speech production)

  43. masker increase in threshold signal stronger masker t time 0 t 200 ms Masking in Time • suggests ~200 ms buffer in auditory system • also seen in perception of loudness, detection of short stimuli, gaps in tones, auditory afterimages, binaural release from masking, ….. • what happens outside this buffer, does no affect detection of signal within the buffer

  44. ~10 ms time processing data x longer time span ? (~250 ms?) time Short-term Features?

  45. time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron Average of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished ) Cortical Receptive Fields

  46. 250-1000 ms 1-3 critical bands FREQUENCY TIME [s] Data for Deriving Posterior Probabilities of Speech Events

  47. time 10-20 ms time 200-1000 ms data x 1-3 Bark 200-1000 ms 1-3 Bark all-pole model of part of time-frequency plane 200-1000 ms How to Get Estimates of Temporal Evolution of Spectral Energy ?- with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague)

More Related