Analyzing the Speech Signal

Analyzing the Speech Signal Julia Hirschberg CS 6998

Basic Acoustics • What is sound? • Pressure fluctuations in the air caused by a musical instrument, a car horn, a voice • Cause eardrum to move • Auditory system translates into neural impulses • Brain interprets as sound • How does it travel? • Via sound wave of air molecules that ‘travels’ thru air

Molecules don’t travel but pressure fluctuations do • But sound waves lose energy as they travel --it takes energy to move those molecules • And molecules also move for reasons other than e.g. the sound of my voice: noise • Ratio of speech-generated molecular motion to other motion: signal-to-noise ratio

Types of Sound: Periodic Waves • Simple Periodic Waves (sine waves) defined by • Frequency: how often does pattern repeat per time unit • Cycle: one repetition • Period: duration of cycle • Frequency=# cycles per time unit, e.g. • Frequency in Hz=1sec/period_in_sec • Horizontal axis of waveform • Amplitude:peak deviation of pressure from normal atmospheric pressure

Phase: timing of waveform relative to a reference point • Complex periodic waves (eg) • Cyclic but composed of two or more sine waves • Fundamental frequency (F0): rate at which largest pattern repeats (also GCD of component freqs) • Components not always easily identifiable: power spectrum graphs amplitude vs. frequency

Fourier’s Theorem • Any complex waveform can be analyzed into a set of sine waves with their own frequencies, amplitudes, and phases • Fourier analysis produces power spectrum from complex periodic wave • Potential problems: • Assumes infinite waveform when we have only a small window for analysis • Waveform itself may be inaccurately represented

Types of Sound: Aperiodic Waves • Waveforms with random or non-repeating patterns (eg) • Random aperiodic waveforms: white noise • Flat spectrum: equal amplitude for all frequency components • Transients: sudden bursts of pressure (clicks, pops, door slams) • Waveform shows a single impulse • Fourier analysis shows a flat spectrum

Sample Analyses • Wavesurfer • Download from http://www.speech.kth.se/wavesurfer/download.html

Filters • Acoustic filters block out certain frequencies of sounds • Low-pass filter blocks high frequency components of a waveform • High-pass filter blocks low frequencies • Rejectband (what to block) vs. passband (what to let through)

Production of Speech • Voiced and voiceless sounds • Vocal fold vibration produces complex periodic waveform • Cycles per sec of lowest frequency component of signal = fundamental frequency (F0) • Fourier analysis yields power spectrum with component frequencies and amplitudes • F0 is first (lowest frequency) peak • Harmonics are resonances of vocal folds multiples of F0 • Vocal tract filters simple voicing waveform to create complex wave

Digital Signal Processing • Analog devices store and analyze continuous air pressure variations (speech) as a continuous signal • Digital devices (e.g. computers) first convert continuous signals into discrete signals (A-to-D conversion) • Sampling: how many time points in the signal to consider? • Quantization: how accurately do we want to measure amplitude at sampling points?

Sampling • Sampling rate: how often do we need to sample? • At least 2 samples per cycle to capture periodicity of a waveform component at a given frequency • 100 Hz waveform needs 200 samples per sec • Nyquist frequency: highest-frequency component captured with a given sampling rate (half the sampling rate)

Samping/storage tradeoff • Human hearing: 20K top frequency • But do we really need to store 40K samples per second of speech? • Telephone speech: 300-4K Hz (8K sampling) • But fricatives have energy above 4K • 16-22K usually good enough

Sampling Errors • Aliasing: • Signal’s frequency higher than half the sampling rate • Solutions: • Increase the sampling rate • Filter out frequencies above half the sampling rate (anti-aliasingfilter)

Quantization • Measuring the amplitude at sampling points: what resolution to choose? • Integer representation • 8, 12 or 16 bits per sample • Noise due to quantization steps avoided by higher resolution but requires more storage • Choice depends on what kind of analysis to be done

But clipping occurs when input volume is greater than range representable in digitized waveform  transients

Perception of Pitch • Auditory system’s perception of pitch is non-linear • Sounds at lower frequencies with same difference in absolute frequency sound more different than those at higher frequencies • Bark scale (Zwicker) models perceived difference

Pitch-Tracking Autocorrelation techniques • Goal: Estimate F0 over time as fn of vocal fold vibration • A periodic waveform is correlated with itself • One period looks much like another (eg) • Find the period by finding the ‘lag’ (offset) between two windows on the signal for which the correlation of the windows is highest • Lag duration (T) is 1 period of waveform • Inverse is F0 (1/T)

Errors: • Halving: shortest lag calculated is too long (underestimate pitch) • Doubling: shortest lag too short (overestimate pitch)

Pitch Track Headers • version 1 • type_code 4 • frequency 12000.000000 • samples 160768 • start_time 0.000000 • end_time 13.397333 • bandwidth 6000.000000 • dimensions 1 • maximum 9660.000000 • minimum -17384.000000 • time Sat Nov 2 15:55:50 1991 • operation record: padding xxxxxxxxxxxx

Pitch Track Data • F0 Pvoicing Energy A/C Score • 147.896 1 2154.07 0.902643 • 140.894 1 1544.93 0.967008 • 138.05 1 1080.55 0.92588 • 130.399 1 745.262 0.595265 • 0 0 567.153 0.504029 • 0 0 638.037 0.222939 • 0 0 670.936 0.370024 • 0 0 790.751 0.357141 • 141.215 1 1281.1 0.904345

RMS Amplitude • Energy closely correlated experimentally with perceived loudness • For each window, square the amplitude values of the samples, take their mean, and take the root of that mean • What size window? • Longer windows produce smoother amplitude traces but miss sudden acoustic events

Perception of Loudness • Non-linear: Described in sones or decibels (dB) • Differences in soft sounds more salient than loud • Intensity proportional to square of amplitude so…intensity of sound with pressure x vs. reference sound with pressure r = x2/r2 • bel: base 10 log of ratio • decibel: 10 bels • dB = 10log10 (x2/r2) • Absolute (20 Pa, lowest audible pressure fluctuation of 1000 Hz tone) or typical threshold level for tone at frequency

Pressure of Common Sounds Event Pressure Db Absolute 20 0 Whisper 200 20 Quiet office 2K 40 Conversation 20K 60 Bus 200K 80 Subway 2M 100 Thunder 20M 120 *DAMAGE* 200M 140

Speech Analysis Gives us Information • About variation in • Loudness • Pitch (contours, accent, phrasing, range) • Timing (rate, pauses) • Style (articulation, disfluencies) • This can be correlated with other features • Syntax, semantics, discourse context, words

Now and Next Week • Now: turn in discussion questions and project ideas • Read HLT96 (Ch. 5) • Try out some TTS systems; exercises • Bring 3 discussion questions to class • Decide which week you would like to help with class

Vocal fold vibration [UCLA Phonetics Lab demo]

alveolar post-alveolar/palatal dental velar uvular labial pharyngeal laryngeal/glottal Places of articulation http://www.chass.utoronto.ca/~danhall/phonetics/sammy.html

Analyzing the Speech Signal