1 / 18

Introduction to speech processing

Introduction to speech processing. Dr Christina Orphanidou christina.orphanidou@eng.ox.ac.uk. Speech. Speech is the fundamental (and most desirable) means of communication between humans. The speech signal carries many kinds of information: the content (or message)

chelsi
Download Presentation

Introduction to speech processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to speech processing Dr Christina Orphanidou christina.orphanidou@eng.ox.ac.uk

  2. Speech • Speech is the fundamental (and most desirable) means of communication between • humans. • The speech signal carries many kinds of information: • the content (or message) • the identity of the speaker • the intent of the speaker • the emotional state of the speaker • And much more • The question that we need to answer is: does the speech signal carry, in an identifiable • way, pathological information about the speaker? Can we discover features from the • speech signal which change in a consistent way when a certain pathology is present? • “ You don’t sound well”.....

  3. Where does speech come from? Speech is produced, perceived and understood by the most complex of all machines. Speech

  4. The speech signal Speech can be defined as waves of air pressure created by airflow pressed out of the Lungs and going out through the mouth and nasal cavities. The air passes through the vocal folds (chords) via the path from the lungs through the vocal tract, vibrating them at different frequencies.

  5. Speech production The vocal folds are thin muscles looking like lips, located at the larynx. When they are closed, they are expanded against each other causing an air block. Air under pressure from the lungs can open this air block. Once the air passes the pressure Declines and they close again. This process repeats, vibrating the vocal folds to give a voiced sound. Men have longer vocal folds that’s why their frequency of vibration is lower and the voice deeper. Unvoiced sounds are formed when a constriction is formed at the vocal tract, causing air turbulence and then random noise. The resonant frequencies of the vocal tract tube are called the formant frequencies.

  6. Speech signal representation The four sounds of the word ‘’test’’ What can we notice?

  7. Fundamental speech model The basic acoustic properties of speech production are most often described using the source-filter model. This model assumes that a source or excitation waveform is input into a time-varying filter. The excitation waveform approximates the airflow produced by the lungs and the filter approximates the effect of the vocal tract shape on the excitation waveform.

  8. Source-filter model of speech production When voiced sound is produced the filter is excited by an impulse response. When unvoiced sounds are produced the filter is excited by random noise. The filter is described by an impulse response h(t) which is convolved with the excitation signal x(t) in order to produce the output signal y(t): Ways of deconvolving this have been designed over the years in order to get representations of the excitation signal and filter, both relevant to speaker identity. The most popular model is that of Linear Prediction Coding

  9. Linear Prediction Coding The linear prediction coefficients are useful for storing speech and as features for the development of Speech processing applications. The main idea of linear prediction coding is that a speech sample can be closely approximated by an AR model. Here x[n-k]: previous speech samples p: order of the model ak : prediction coefficient e[n]: prediction error The predicted signal is: Simplification: To find ak we minimize

  10. The cepstrum-I Recall that in the source-filter model, the output signal s is expressed in terms of a convolution of a rapidly-varying excitation e and a slowly varying filter h. The DTFT is then: The aim is to deconvolve the signal in order to obtain representations of the transfer function and excitation.... Cepstrum is an anagram of “spectrum” and it reflects the idea that it is a means of turning the spectrum “inside out”.

  11. The cepstrum-II One way is to take the absolute value of the DFT representation and then take the logarithm, yielding: The complex cepstrum of a sequence s(n) of finite length is defined to be the Fourier transform of its log spectrum: Since the index n represents the time at which the sample was taken, the unit of the cepstrum is time rather than frequency. The quantity measured is referred to as “quefrency”.

  12. LPC Cepstral coefficients They are based upon the vocal approximation provided by linear prediction. Conveniently, there exists an algorithm which allows us to calculate these Coefficients directly from the LPC parameters. The transfer function which describes the filter can be rewritten as: The logaritm of the z-transform can be written as a Laurent expansion:

  13. LPC Cepstral coefficients We write and differentiate both sides with respect to u: Let m be a positive integer. We can compare coefficients of to get:

  14. LPC Cepstral coefficients If we divide by m and then rewrite, we get : Which are the LPC cepstral coefficients. Since the LPC cepstral coefficients are good approximations to perceptually relevant characteristics of speech and are thus used extensively as speech features.

  15. Estimating fundamental frequency Since we are performing Fourier analysis upon the log spectrum of the signal We expect that there will be a peak in the cepstrum which corresponds to the regularly spaced harmonics in the spectrum. These harmonics in speech occur at integer multiples of the fundamental frequency of the signal. The peak in the cepstrum in the typical range is a good approximation of the fundamental frequency. Men: 70-250 Hz Women: 150-400 Hz

  16. Understanding variations in speech • Linear prediction coefficients (and their variants) is the most popular way of modelling speech since the coefficients approximate the formant frequencies of the vocal tract. A 128-sample of speech can be effectively approximated with 12 coefficients which can model the speech and the voice. • Speaker recognition technology has shown that in addition to the formant characteristics, speaker identity is also correlated with loudness, speed, intonation, accent (prosody). • These characteristics can be understood by looking at two different speech spectrums.

  17. Biomedical Speech Processing?

More Related