Acoustics of Speech

Acoustics of Speech Lecture 4 Spoken Language Processing Prof. Andrew Rosenberg

Overview • What is in a speech signal? • Defining cues to phonetic segments and intonation. • Techniques to extract these cues.

Phone Recognition • Goal: Distinguishing One Phoneme from Another…Automatically • ASR: Did the caller say “I want to fly to Newark” or “I want to fly to New York”? • Forensic Linguistics: Did that person say “Kill him” or “Bill him” • What evidence is available in the speech signal? • How accurately and reliably can we extract it? • What qualities make this difficult? easy?

Prosody and Intonation • How things are said is sometimes critical and often useful for understanding • Forensic Linguistics: “Kill him!” vs. “Kill him?” • TTS: “Travelling from Boston?” vs. “Travelling from Boston.” • What information do we need to extract from/generate in the speech signal? • What tools do we have to do this?

Speech Features • What cues are important? • Spectral Features • Fundamental Frequency (pitch) • Amplitude/energy (loudness) • Timing (pauses, rate) • Voice Quality • How do we extract these? • Digital Signal Processing • Tools and Algorithms • Praat • Wavesurfer • Xwaves

Sound Production • Pressure fluctuations in the air caused by a voice, musical instrument, a car horn etc. • Sound waves propagate through material air, but also solids, etc. • Cause eardrum (tympanum) to vibrate • Auditory system translates this into neural impulses • Brain interprets these as sound • Represent sounds as change in pressure over time

How “loud” are sounds?

Voiced Sounds are (mostly) Periodic • Simple Periodic Waves (sine waves) defined by • Frequency: how often does the pattern repeat per time unit • Cycle: one repetition • Period: duration of a cycle • Frequency: #cycles per time unit (usually second) • Frequency in Hertz (Hz): cycles per second or 1 / period • E.g. 400 Hz = 1/0.0025 (a cycle has a period of 0.0025 seconds; 400 cycles complete in a second) • Zero crossing: where the waveform crosses the x-axis

Voiced Sounds are (mostly) Periodic • Simple Periodic Waves (sine waves) defined by • Amplitude: peak deviation of pressure from normal atmospheric pressure • Phase: timing of a waveform relative to a reference point

Phase Differences

Complex Periodic Waves • Cyclic but composed of multiple sine waves • Fundamental Frequency (F0): rate at which the largest pattern repeats and its harmonics • Also GCD of component frequencies • Harmonics: rate of shorter patterns • Any complex waveform can be analyzed into its component sine waves with their frequencies, amplitudes and phases (Fourier theorem – in 2 lectures)

2 sine wave -> 1 complex wave

4 sine waves -> 1 complex wave

Power Spectra and Spectrograms • Frequency components of a complex waveform represened in the power spectrum. • Plots frequency and amplitude of each component sine wave • Adding temporal dimension -> Spectrogram • Obtained via Fast Fourier Transform (FFT), Linear Predictive Coding (LPC)… • Useful for analysis, coding and synthesis.

Example Power spectrum Australian male /i:/ from “heed” FFT analysis window 12.8ms http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html

Example Spectrogram

Terms • Spectral Slice: plots the amplitude at each frequency • Spectrograms: plots amplitude and frequency over time • Harmonics: components of a complex waveform that are multiples of the fundamental frequency (F0) • Formants: frequency bands that are most amplified in speech.

Aperiodic Waveforms • Waveforms with random or non-repeating patterns • Random aperiodic waveforms: white noise • Flat spectrum: equal amplitude for all frequency components. • Transients: sudden bursts of pressure (clicks, pops, lip smacks, door slams, etc.) • Flat spectrum at a single impulse • Voiceless consonants

Speech Waveforms • Lungs plus vocal fold vibration is filtered by resonance of the vocal tract to produce complex, periodic waveforms. • Pitch range, mean, max: cycles per sec of lowest frequency periodic component of a signal = “Fundamental frequency (F0)” • Loudness • RMS amplitude • Intensity: in dB where P0 is a reference atmospheric pressure

Collecting speech for analysis? • Recording conditions • A quiet office, a sound booth, an anechoic chamber • Microphones convert sound into electrical current • oscillations of air pressure are converted to oscillations of current • Analog devices (e.g. tape recorders) store these as a continuous signal • Digital devices (e.g. DAT, computers) convert to a digital signal (digitizing)

Digital Sound Representation • A microphone is a mechanical eardrum, capable of measuring change in air pressure over time. • Digital recording converts analog(smoothly continuous) changes in air pressure over time to a digital signal. • The digital representation: • measures the pressure at a fixed time interval sampling rate • represents pressure as an integral valuebit depth • The analog to digital conversion results in a loss of information.

Waveform – “Name”

Analog to Digital Conversion • “Quantization” or “Discretization”

Analog to Digital Conversion • Bit depth impact • 16bit sound – CD Quality • 8bit sound • Sampling rate impact • 44.1kHz • 16kHz • 8kHz • 4kHz

Nyquist Rate • At least 2 samples per cycle are necessary to capture the periodicity of a waveform at a given frequency • 100Hz needs 200 samples per sec • Nyquist Frequency or Nyquist Rate • Highest frequency that can be captured with a given sampling rate • 8kHz sampling rate (Telephone speech) can capture frequencies up to 4kHz

Sampling/storage trade off • Human hearing: ~20kHz top frequency • Should we store 40kHz samples? • Telephone speech 300-4kHz (8kHz sampling) • But some speech sounds, (e.g., fricatives, stops) have energy above 4kHz • Peter, Teeter, Dieter • 44kHz (CD quality) vs. 16-22kHz • Usually good enough to study speech, amplitude, duration, pitch, etc. • Golden Ears.

Filtering • Acoustic filters block out certain frequencies of sounds • Low-pass filter blocks high frequency components • High-pass filter blocks low frequencies • Band-pass filter blocks both high and low, around a band • Reject band (what to block) vs. pass band (what to let through) • What if the frequencies fo two sounds overlap? • Source Separation

Estimating pitch • Pitch Tracking: Estimate F0 over time as a function of vocal fold vibration • How? Autocorrelation approach • A periodic waveform is correlated with itself, since one period looks like another • Find the period by finding the “lag” (offset) between two windows of the signal where the correlation of the windows is highest • Lag duration, T, is one period of the the waveform • F0 is the inverse: 1/T

Pitch Issues • Microprosody effects of consonants (e.g. /v/) • Creaky voice -> no pitch track, or noisy estimate • Errors to watch for: • Halving: shortest lag calculated is too long, by one or more cycles. • Since the estimated lag is too long, the pitch is too low (underestimation) of pitch • Doubling: shortest lag is too short. Second half of the cycle is similar to the first • Estimates a short lag, counts too many cycles per second (overestimation) of pitch

Pitch Doubling and Halving DoublingError Halving Error

Next Class • Speech Recognition Overview • Reading: J&M 9.1, 9.2, 5.5

Acoustics of Speech