Speech Processing

Speech Processing Speech Coding

Speech Coding • Definition: • Speech Coding is a process that leads to the representation of analog waveforms with sequences of binary digits. • Even though availability of high-bandwidth communication channels has increased, speech coding for bit reduction has retained its importance. • Reduced bit-rates transmissions is required for • cellular networks • Voice over IP • Coded speech • Is less sensitive than analog signals to transmission noise • Easier to: • protect against (bit) errors • Encrypt • Multiplex, and • Packetize • Typical Scenario depicted in next slide (Figure 12.1) Veton Këpuska

Digital Telephone Communication System Veton Këpuska

Categorization of Speech Coders • Waveform Coders: • Used to quantize speech samples directly and operate at high-bit rates in the range of 16-64 kbps (bps - bits per second) • Hybrid Coders • Are partially waveform coders and partly speech model-based coders and operate in the mid bit rate range of 2.4-16 kbps. • Vocoders • Largely model-based and operate at a low bit rate range of 1.2-4.8 kbps. • Tend to be of lower quality than waveform and hybrid coders. Veton Këpuska

Quality Measurements • Quality of coding can is viewed as the closeness of the processed speech to the original speech or some other desired speech waveform. • Naturalness • Degree of background artifacts • Intelligibility • Speaker identifiability • Etc. Veton Këpuska

Quality Measurements • Subjective Measurement: • Diagnostic Rhyme Test (DRT) measures intelligibility. • Diagnostic Acceptability Measure and Mean Opinion Score (MOS) test provide a more complete quality judgment. • Objective Measurement: • Segmental Signal to Noise Ratio (SNR) – average SNR over a short-time segments • Articulation Index – relies on an average SNR across frequency bands. Veton Këpuska

Quality Measurements • A more complete list and definition of subjective and objective measures can be found at: • J.R. Deller, J.G. Proakis, and J.H.I Hansen, “Discrete-Time Processing of Speech”, Macmillan Publishing Co., New York, NY, 1993 • S.R. Quackenbush, T.P. Barnwell, and M.A. Clements, “Objective Measures of Speech Quality. Prentice Hall, Englewood Cliffs, NJ. 1988 Veton Këpuska

Statistical Models • Speech waveform is viewed as a random process. • Various estimates are important from this statistical perspective: • Probability density • Mean, Variance and autocorrelation • One approach to estimate a probability density function (pdf) of x[n] is through histogram. • Count up the number of occurrences of the value of each speech sample in different ranges:for many speech samples over a long time duration. • Normalize the area of the resulting curve to unity. Veton Këpuska

Statistical Models • The histogram of speech (Davenport, Paez & Glisson) was shown to approximate a gamma density:where x is the standard deviation of the pdf. • Simpler approximation is given by the Laplacian pdf of the form: Veton Këpuska

PDF of Speech Veton Këpuska

PDF Models of Speech Veton Këpuska

Scalar Quantization • Assume that a sequence x[n] was obtained from speech waveform that has been lowpass-filtered and sampled at a suitable rate with infinite amplitude precision. • x[n] samples are quantized to a finite set of amplitudes denoted by . • Associated with the quantizer is a quantization step size. • Quantization allows the amplitudes to be represented by finite set of bit patterns – symbols. • Encoding: • Mapping of to a finite set of symbols. • This mapping yields a sequence of codewords denoted by c[n] (Figure 12.3a). • Decoding – Inverse process whereby transmitted sequence of codewords c’[n] is transformed back to a sequence of quantized samples (Figure 12.3b). Veton Këpuska

Scalar Quantization Veton Këpuska

Fundamentals • Assume a signal amplitude is quantized into M levels. • Quantizer operator is denoted by Q(x); Thus • Where denotes M possible reconstruction levels – quantization levels, and • 1≤i≤M • xi denotes M+1 possible decision levels with 0≤i≤M • If xi-1< x[n] < xi, then x[n] is quantized to the reconstruction level • is quantized sample of x[n]. Veton Këpuska

Fundamentals • Scalar Quantization Example: • Assume there M=4 reconstruction levels. • Amplitude of the input signal x[n] falls in the range of [0,1] • Decision levels and Reconstruction levels are equally spaced: • Decision levels are [0,1/4,1/2,3/4,1] • Reconstruction levels assumed to be [0,1/8,3/8,5/8,7/8] • Figure 12.4 in the next slide. Veton Këpuska

Example of Uniform 2-bit Quantizer Veton Këpuska

Uniform Quantizer • A uniform quantizer is one whose decision and reconstruction levels are uniformly spaced. Specifically: •  is the step size equal to the spacing between two consecutive decision levels which is the same spacing between two consecutive reconstruction levels (Exercise 12.1). • Each reconstruction level is attached a symbol – the codeword. Binary numbers typically used to represent the quantized samples (Figure 12.4). Veton Këpuska

Uniform Quantizer • Codebook: Collection of codewords. • In general with B-bit binary codebook there are 2B different quantization (or reconstruction) levels. • Bit rate is defined as the number of bits B per sample multiplied by sample rate fs: I=Bfs • Decoder inverts the coder operation taking the codeword back to a quantized amplitude value (e.g., 01 → ). • Often the goal of speech coding/decoding is to maintain the bit rate as low as possible while maintaining a required level of quality. • Because sampling rate is fixed for most applications this goal implies that the bit rate be reduced by decreasing the number of bits per sample Veton Këpuska

Uniform Quantizer • Designing a uniform scalar quantizer requires knowledge of the maximum value of the sequence. • Typically the range of the speech signal is expressed in terms of the standard deviation of the signal. • Specifically, it is often assumed that: -4x≤x[n]≤4x where x is signal’s standard deviation. • Under the assumption that speech samples obey Laplacian pdf there are approximately 0.35% of speech samples fall outside of the range: -4x≤x[n]≤4x. • Assume B-bit binary codebook ⇒ 2B. • Maximum signal value xmax = 4x. Veton Këpuska

Uniform Quantizer • For the uniform quantization step size  we get: • Quantization step size  relates directly to the notion of quantization noise. Veton Këpuska

Quantization Noise • Two classes of quantization noise: • Granular Distortion • Overload Distortion • Granular Distortion • x[n] unquantized signal and e[n] is the quantization noise. • For given step size  the magnitude of the quantization noise e[n] can be no greater than /2, that is: • Figure 12.5 depicts this property were: Veton Këpuska

Quantization Noise Veton Këpuska

Quantization Noise • Overload Distortion • Maximum-value constant: • xmax = 4x (-4x≤x[n]≤4x) • For Laplacian pdf, 0.35% of the speech samples fall outside the range of the quantizer. • Clipped samples incur a quantization error in excess of /2. • Due to the small number of clipped samples it is common to neglect the infrequent large errors in theoretical calculations. Veton Këpuska

Quantization Noise • Statistical Model of Quantization Noise • Desired approach in analyzing the quantization error in numerous applications. • Quantization error is considered an ergodic white-noise random process. • The autocorrelation function of such a process is expressed as: Veton Këpuska

Quantization Error • Previous expression states that the process is uncorrelated. • Furthermore, it is also assumed that the quantization noise and the input signal are uncorrelated, i.e., • E(x[n]e[n+m])=0,  m. • Final assumption is that the pdf of the quantization noise is uniform over the quantization interval: Veton Këpuska

Quantization Error • Stated assumptions are not always valid. • Consider a slowly varying – linearly varying signal ⇒ then e[n] is also changing linearly and is signal dependent (see Figure 12.5 in the previous slide). • Correlated quantization noise can be annoying. • When quantization step  is small then assumptions for the noise being uncorrelated with itself and the signal are roughly valid when the signal fluctuates rapidly among all quantization levels. • Quantization error approaches a white-noise process with an impulsive autocorrelation and flat spectrum. • One can force e[n] to be white-noise and uncorrelated with x[n] by adding white-noise to x[n] prior to quantization. Veton Këpuska

Quantization Error • Process of adding white noise is known as Dithering. • This decorrelation technique was shown to be useful not only in improving the perceptual quality of the quantization noise but also with image signals. • Signal-to-Noise Ratio • A measure to quantify severity of the quantization noise. • Relates the strength of the signal to the strength of the quantization noise. Veton Këpuska

Quantization Error • SNR is defined as: • Given assumptions for • Quantizer range: 2xmax, and • Quantization interval: = 2xmax/2B, for a B-bit quantizer • Uniform pdf, it can be shown that (see Exercise 12.2): Veton Këpuska

Quantization Error • Thus SNR can be expressed as: • Or in decibels (dB) as: • Because xmax = 4x, then SNR(dB)≈6B-7.2 Veton Këpuska

Quantization Error • Presented quantization scheme is called pulse code modulation (PCM). • B-bits per sample are transmitted as a codeword. • Advantages of this scheme: • It is instantaneous (no coding delay) • Independent of the signal content (voice, music, etc.) • Disadvantages: • It requires minimum of 11 bits per sample to achieve “toll quality” (equivalent to a typical telephone quality) • For 10000 Hz sampling rate, the required bit rate is:B=(11 bits/sample)x(10000 samples/sec)=110,000 bps=110 kbps • For CD quality signal with sample rate of 20000 Hz and 16-bits/sample, SNR(dB) =96-7.2=88.8 dB and bit rate of 320 kbps. Veton Këpuska

Nonuniform Quantization • Uniform quantization may not be optimal (SNR can not be as small as possible for certain number of decision and reconstruction levels) • Consider for example speech signal for which x[n] is much more likely to be in one particular region than in other (low values occurring much more often than the high values). • This implies that decision and reconstruction levels are not being utilized effectively with uniform intervals over xmax. • A Nonuniform quantization that is optimal (in a least-squared error sense) for a particular pdf is referred to as the Max quantizer. • Example of a nonuniform quantizer is given in the figure in the next slide. Veton Këpuska

Nonuniform Quantization Veton Këpuska

Nonuniform Quantization • Max Quantizer • Problem Definition: For a random variable x with a known pdf, find the set of M quantizer levels that minimizes the quantization error. • Therefore, finding the decision and boundary levels xi and xi, respectively, that minimizes the mean-squared error (MSE) distortion measure: D=E[(x-x)2] • E-denotes expected value and x is the quantized version of x. • It turns out that optimal decision level xk is given by: ^ ^ ^ Veton Këpuska

Nonuniform Quantization • Max Quantizer (cont.) • The optimal reconstruction level xk is the centroid of px(x) over the interval xk-1≤ x ≤xk: • It is interpreted as the mean value of x over interval xk-1≤ x ≤xk for the normalized pdf p(x). • Solving last two equations for xk and xk is a nonlinear problem in these two variables. • Iterative solution which requires obtaining pdf (can be difficult). ^ ~ ^ Veton Këpuska

Nonuniform Quantization Veton Këpuska

Companding • Alternative to the nonuniform quantizer is companding. • It is based on the fact that uniform quantizer is optimal for a uniform pdf. • Thus if a nonlinearity is applied to the waveform x[n] to form a new sequence g[n] whose pdf is uniform then • Uniform quantizer can be applied to g[n] to obtain g[n], as depicted in the Figure 12.10 in the next slide. ^ Veton Këpuska

Companding Veton Këpuska

Companding • A number of other nonlinear approximations of nonlinear transformation that achieves uniform density are used in practice which do not require pdf measurement. • Specifically and A-law and –law companding. • -law coding is give by: • CCITT international standard coder at 64 kbps is an example application of -law coding. • -law transformation followed by 7-bit uniform quantization giving toll quality speech. • Equivalent quality of straight uniform quantization achieved by 11 bits. Veton Këpuska

Adaptive Coding • Nonuniform quantizers are optimal for a long term pdf of speech signal. • However, considering that speech is a highly-time-varying signal, one has to question if a single pdf derived from a long-time speech waveform is a reasonable assumption. • Changes in the speech waveform: • Temporal and spectral variations due to transitions from unvoiced to voiced speech, • Rapid volume changes. • Approach: • Estimate a short-time pdf derived over 20-40 msec intervals. • Short-time pdf estimates are more accurately described by a Gaussian pdf regardless of the speech class. Veton Këpuska

Adaptive Coding • A pdf derived from a short-time speech segment more accurately represents the speech nonstationarity. • One approach is to assume a pdf of a specific shape in particular a Gaussian with unknown variance 2. • Measure the local variance then adapt a nonuniform quantizer to the resulting local pdf. • This approach is referred to as adaptive quantization. • For a Gaussian we have: Veton Këpuska

Adaptive Coding • Measure the variance x2 of a sequence x[n] and use resulting pdf to design optimal max quantizer. • Note that a change in the variance simply scales the time signal: • If E(x2[n]) = x2 then E[(x[n])2] = 2x2 • Need to design only one nonuniform quantizer with unity variance and scale decision and reconstruction levels according to a particular variance. • Fix the quantizer and apply a time-varying gain to the signal according to the estimated variance (scale the signal to match the quantizer). Veton Këpuska

Adaptive Coding Veton Këpuska

Adaptive Coding • There are two possible approaches for estimation of a time-varying variance 2[n]: • Feed-forward method (shown in Figure 12.11) where the variance (or gain) estimate is obtained from the input • Feed-back method where the estimate is obtained from a quantizer output. • Advantage – no need to transmit extra side information (quantized variance) • Disadvantage – additional sensitivity to transmission errors in codewords. • Adaptive quantizers can achieve higher SNR than the use of –law companding. • –law companding is generally preferred for high-rate waveform coding because of its lower background noise when transmission channel is idle. • Adaptive quantization is useful in variety of other coding schemes. Veton Këpuska

Differential and Residual Quantization • Presented methods are examples of instantaneous quantization. • Those approaches do not take advantage of the fact that speech, music, … is highly correlated signal: • Short-time (10-15 samples), as well as • Long-time (over a pitch period) • In this section methods that exploit short-time correlation will be investigated. Veton Këpuska

Differential and Residual Quantization • Short-time Correlation: • Neighboring samples are “self-similar”, that is, not changing too rapidly from one another. • Difference of adjacent samples should have a lower variance than the variance of the signal itself. • This difference, thus, would make a more effective use of quantization levels: • Higher SNR for fixed number of quantization levels. • Predicting the next sample from previous ones (finding the best prediction coefficients to yield a minimum mean-squared prediction error  same methodology as in LPC of Chapter 5). Two approaches: • Have a fixed prediction filter to reflect the average local correlation of the signal. • Allow predictor to short-time adapt to the signal’s local correlation. • Requires transmission of quantized prediction coefficients as well as the prediction error. Veton Këpuska

Differential and Residual Quantization • Illustration of a particular error encoding scheme presented in the Figure 12.12 of the next slide. • In this scheme the following sequences are required: • x[n] – prediction of the input sample x[n]; This is the output of the predictor P(z) whose input is a quantized version of the input signal x[n], i.e., x[n] • r[n] – prediction error signal; residual • r[n] – quantized prediction error signal. • This approach is sometimes referred to as residual coding. ~ ^ ^ Veton Këpuska

Differential and Residual Quantization Veton Këpuska

Differential and Residual Quantization • Quantizer in the previous scheme can be of any type: • Fixed • Adaptive • Uniform • Nonuniform • Whatever the case is, the parameter of the quantizer are determined so that to match variance of r[n]. • Differential quantization can also be applied to: • Speech, Music, … signal • Parameters that represent speech, music, …: • LPC – linear prediction coefficients • Cepstral coefficients obtained from Homomorphic filtering. • Sinewave parameters, etc. Veton Këpuska

Differential and Residual Quantization • Consider quantization error of the quantized residual: • From Figure 12.12 we express the quantized input x[n] as: ^ Veton Këpuska

Differential and Residual Quantization • Quantized signal samples differ form the input only by the quantization error er[n]. • Since the er[n] is the quantization error of the residual: ⇒ If the prediction of the signal is accurate then the variance of r[n] will be smaller than the variance of x[n] ⇒ A quantizer with a given number of levels can be adjusted to give a smaller quantization error than would be possible when quantizing the signal directly. Veton Këpuska

Speech Processing