290 likes | 759 Views
Pitch-synchronous overlap add (TD-PSOLA). Purpose: Modify pitch or timing of a signal. PSOLA is a time domain algorithm Pseudo code Find the pitch points of the signal Apply H anning window centered on the pitch points and extending to the next and previous pitch point Add waves back
E N D
Pitch-synchronous overlap add (TD-PSOLA) Purpose: Modify pitch or timing of a signal • PSOLA is a time domain algorithm • Pseudo code • Find the pitch points of the signal • Apply Hanning window centered on the pitch points and extending to the next and previous pitch point • Add waves back • To slow down speech, duplicate frames • To speed up, remove frames • Hanning windowing preserves signal energy • Undetectable if epochs are accurately found. Why? We are not altering the vocal filter, but changing signal spacing
TD-PSOLA Illustrations Pitch (window and add) Duration (insert or remove)
TD-PSOLA Pitch Points (Epochs) • TD-PSOLA requires an exact marking of pitch points in a time domain signal • Pitch mark • Marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame • The most common marking point is the instant of glottal closure, which identifies a quick time domain descent • Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn} • Estimate pitch period distance = (pk – pk+1)/2
TD-PSOLA Evaluation • Advantages • As a time domain algorithm, it is unlikely that any other approach will be more efficient (O(N)) • Listeners cannot perceive signal alteration of up to 50% • Disadvantages • Epoch marking must be exact • Only timing changes are possible
Time Domain Pitch Detection • Auto Correlation • Correlate a window of speech with a previous window • Find the best match • Issue: too many false peaks • Peak and center clipping • Algorithm to reduce false peaks • clip the top/bottom of a signal • Center the remainder around 0 • Other alternatives • Researchers propose many other pitch detection algorithms • There are much debate as to which is the best
Auto Correlation • Auto Correlation1/M ∑n=0,M-1 xn xn-k ;if n-k < 0 xn-k = 0Find the k that maximizes the sum • Difference Function1/M ∑n=1,M-1 |(xn – xn-k)|; if n-k<0 sn-k = 0Find the k that minimizes the sum • Considerations • Difference approach is faster • Both can get false positives • The YIN algorithm combines both techniques
Harmonic Product Spectrum Pseudo Code Divide signal into frames (20-30 ms long) Perform FFT Down sample FFT by factors of 2, 3, 4 (taking every 2nd , 3rd , 4th values) Add FFT and down sampled spectrums together The pitch harmonics will line up (The spectrum will “spike” at the pitch value) Find the spike: return fsample / fftSize * index
Background Noise • Definition: an unwanted sound or an unwanted perturbation to a wanted signal • Examples: • Clicks from microphone synchronization • Ambient noise level: background noise • Roadway noise • Machinery • Additional speakers • Background activities: TV, Radio, dog barks, etc. • Classifications • Stationary: doesn’t change with time (i.e. fan) • Non-stationary: changes with time (i.e. door closing, TV)
Noise Spectrums Power measured relative to frequency f • White Noise: constant over range of f • Pink Noise: Decreases by 3db per octave; perceived equal across f • Brown(ian): Decreases proportional to 1/f2 per octave • Red: Decreases with f (either pink or brown) • Blue: increases proportional to f • Violet: increases proportional to f2 • Gray: proportional to a psycho-acoustical curve • Orange: bands of 0 around musical notes • Green: noise of the world; pink, with a bump near 500 HZ • Black: 0 everywhere except 1/fβ where β>2 in spikes • Colored: Any noise that is not white Audio samples:http://en.wikipedia.org/wiki/Colors_of_noise Signal Processing Information Base:http://spib.rice.edu/spib.html
Applications • ASR: Prevent significant degradation in noisy environments Goal: Minimize recognition degradation with noise present • Sound Editing and Archival: • Improve intelligibility of audio recordings • Goals: Eliminate perceptible noise; recover audio from wax recordings • Mobile Telephony: • Transmission of audio in high noise environments • Goal: Reduce transmission requirements • Comparing audio signals • A variety of digital signal processing applications • Goal: Normalize audio signals for ease of comparison
Signal to Noise Ratio (SNR) • Definition: Power ratio between a signal and noise that interferes. • Standard Equation in decibels: SNRdb = 10 log(A Signal/ANoise)2 N= 20 log(Asignal/Anoise) • For digitized speech SNRf = P(signal)/P(noise) = 10 log(∑n=0,N-1sf(n)2/nf(x)2) • sf is an array holding samples from a frame • nf is an array of noise samples. • Note: if sf(n) = nf(x), SNRf = 0
Stationary Noise Suppression • Requirements • Maximize the amount of noise removed • Minimize signal distortion • Efficient algorithm with low big-Oh complexity • Problems • Tradeoff between removing noise and distorting the signal • More noise removal tends to distort the signal • Popular approaches • Time domain: Moving average filter (distorts frequency domain) • Frequency domain: Spectral Subtraction • Time domain: Weiner filter (using LPC)
Auto regression Noise Removal • Definition: An autoregressive process is one where a value can be determined by a linear combination of previous values • Formula: Xt = c + ∑0,P-1ai Xt-i + ntc is a constant, nt is the noise, the summation is the pure signal • This is none other than linear prediction; noise is the residue. • Applying the LPC filter to the signal separates noise from signal (Wiener Filter)
Spectral Subtraction Assumption: Noisy signal: yt = st + ntst is the clean signal and nt is additive noise Perform FFT on all windowed frames IF speech not present Update the estimate of the noisy spectrum { σnt + (1- σ)nt-1, 0 <= σ <=1 } ELSE Subtract the estimated noise spectrum Perform an inverse FFT S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP-27, Apr. 1979.
Implementation Issues • Question: How do we estimate the noise?Answer: Use the frequency distribution during times when no voice is present • Question: How do we know when voice is present?Answer: Use Voice Activity Detection algorithms (VAD) • Question: Even if we know the noise amplitudes, what about phase differences between the clean and noisy signals?Answer: Human hearing largely ignores phase differences • Question: Is the noise independent of the signal?Answer: We assume that it noise is linear and does not interact with the signal. • Question: Are noise distributions really stationary?Answer: We assume yes.
Phase Distortions • Problem: We don’t know how much of the phase in an FFT is from noise and from speech. • Assumption: The algorithm assumes the phase of both are the same (that of the noisy signal). • Result: When SNR approaches 0db the audio has an hoarse sounding voice. • Why? The phase assumption means that the expected noise magnitude is incorrectly calculated. • Conclusion: There is a limit to spectral subtraction utility when SNR is close to zero
Evaluation • Advantage: Easy to understand and implement • Disadvantages • The noise estimate is not exact • When too high, speech portions will be lost • When too low, some noise remains • When a noise frequency exceeds the noisy sound frequency, a negative frequency results causes musical tone artifacts • Non-linear or interacting noise • Negligible with large SNR values • Significant impact when SNR is small
Musical noise Definition:Random isolated tone bursts across the frequency. Why? Most implementations set frequency bin magnitudes to zero if noise reduction would cause them to become negative Green dashes: noisy signal, Solid line: noise estimate Black dots: projected clean signal
Spectral Subtraction Enhancements • Eliminate negative frequencies • Reduce the noise estimates by some factor • Vary the noise estimate factor in different frequency bands • Larger in regions outside of human speech range • Apply psycho-acoustical methods • Only attempt to remove perceived noise, not all noise • Human hearing masks sounds of adjacent frequencies • A loud sound masks sounds even after it ceases • Adaptive noise estimation: Nt(f) = λFGt(p-1)+(1-λF)Nt-1(f)
Acoustical Effects • Characteristic Frequency (CF): The frequency that causes maximum response at a point of the Cochlea Basilar Membrane • Neuron exhibit a maximum response for 20 ms and then decrease to a steady state, shortly after the stimulus is removed • Masking effects can be simultaneous or temporal • Simultaneous: one signal drowns out another • Temporal: One signal masks the ones that follow • Forward: still audible after masker removed (5ms–150ms) • Back: weak signal masked from a strong one following (5ms)
Voice Activity Detector (VAD) • Many VAD algorithms exist • Possible approaches to consider • Energy above background noise • Low Zero crossing rate • Determine if pitch is present • Low fractal dimensions compared to pure noise • Low LPC residual • General principle: It is better to misclassify noise as speech than to misclassify speech as noise • Standard algorithms: telephone/cell phone environments
Possible VAD algorithm Note: energy and 0-crossings of noise estimated from the initial ¼ second booleanvad: double[] frame // returns true if speech present IF frame energy < low noise threshold (standard deviation units) RETURN false; IF energy < low noise threshold RETURN FALSE IF energy > high noise threshold RETURN TRUEFOR forward frames IF forward frame energy < low noise threshold RETURN FALSE IF forward frame energy > high noise threshold FOR previous ¼ second of frames COUNT previous frames having a large 0-crossing rate IF count > 0-crossing threshold (standard deviation units) IF this frame index > than first frame with 0-crossing rate > threshold RETURN true RETURN false