Acoustic Landmarks and Speech Information

Speech Information at Acoustic Landmarks Mark Hasegawa-Johnson Electrical and Computer Engineering, UIUC

Outline of this Talk • Where is Speech Information? • A Typology of Acoustic Landmarks • Phoneme Encoding w/Distinctive Features • Infograms • Two-point Infograms • Entropy and Average Classification • Landmark-Synchronous Baum-Welch

Where is Speech Information?

Where is Speech Information? • According to Acoustic Phonetics • Vowel information is in the STEADY STATE • Consonant information is near TRANSITIONS • According to Speech Psychophysics • Speech w/o steady state is intelligible • Speech w/o transitions is unintelligible • According to Speech Recognition • Transition-anchored observations outperform segment-anchored observations.

Typology of Acoustic Landmarks • ACOUSTIC LANDMARK = • Perceptually salient acoustic event • near which phonetic information is dense. • Examples • Consonant Release: ga, ma, sa • Consonant Closure: egg, em, ess • Manner Change: agfa, anfa, asfa, asna • Counter-Examples: Non-Landmarks • Place Change: agda, aftha, amna, this ship

Information about GLIDE is most dense at: Point of maximum constriction • Middle of intersyllabic glide aya, a letter • Start of syllable-initial glide tra • End of syllable-final glide art

Information about a VOWEL is dense near: • 1. On-glide and off-glide (GLIDE landmarks!) • 2. VOWEL LANDMARK: pick a reference time near center of the steady state

Acoustic Landmarks Chosen for the Infogram Experiments • Releases, Closures, Manner Change: as marked in TIMIT • Glide, Flap, or /h/ Pivot Landmark: • Syllable-initial: at START segment boundary • Syllable-final: at END segment boundary • Intersyllabic: halfway through • Vowel Pivot Landmark: • 33% of the distance from START to END

Phoneme Encoding for Infogram Experiments • Encode w/Binary Distinctive Features • /s/ = [+consonantal, -sonorant, +continuant, +strident, +blade, +anterior, -distributed, +stiffvocalfolds, -grave, +fricative] • Feature hierarchy: infogram is based only on syllables for which feature is salient. • Redundant features: partial solution to the problem of context-dependent acoustic implementation.

Distinctive Features in TIMIT Features of a segment are determined based on left, center, and right phones. Once determined, features are attached to any landmarks caused by the center phoneme. [-sonorant] [-continuant] [+strident]

Infograms • Joint probability p(x,d) estimated from TIMIT TRAIN corpus. • Feature takes value d (-1 or +1). • X(t,f), t=0 at landmark, takes value x, x quantized to 23 levels. • Infogram is the mutual information:

Infograms: Manner Features

Infograms: Place Features

Infograms: Vowel Features

Two-Point Infogram • Maximize ID(t1,f1) • Find Joint PMF of d, x, and X(t2,f2)=y • Calculate Two-Point Infogram:

Infograms: One-Point, Two-Point

Conditional Entropy and Average Classification • Infogram ID(t,f )=HD - HD|X(t,f) • a priorientropy HD • min p(d)= f(HD) is classification error probability given NO OBSERVATIONS • conditional entropy HD|X(t,f) • f(HD|X(t,f)) similar to log-average error probability given ONE OBSERVATION at time t, frequency f.

Average Classification Error vs. Entropy [strident] a priori: 0.89 bits, p(error)=0.31 [strident] given one measurement: 0.29 bits, p(error)=0.05 [strident] given two measurements: 0.22 bits, p(error)=0.04

From Classification to Recognition • Histogram gives a classification probability bti(di)=p(di|Li, ti, Xi(t,f) ) • di = vector of feature values e.g. di= [ +consonantal, -sonorant, +continuant, ...] • Li = landmark type e.g. release, closure, pivot • ti = landmark time • Xi(t,f) = spectrogram:t =t- ti • Duration probabilities define a transition a(ti,tk )=p( Lkat tk | Liat ti )

Landmark-Synchronous Baum-Welch Algorithm Probability of transcription [L1,d1,L2,d2,] given an observation matrix X:

Whats Different About the LM-Synchronous Baum-Welch? • Traditional: • Time is independent variable: t=1,2,3,... • Phonetic State is dependent variable, governed by transition probabilities: aik • Landmark-synchronous: • Phonetic Landmark is independent variable: Li=L1, L2, L3,... • Time is dependent variable, governed by transition probabilities: a(ti,tk )

Potential Advantages of LM-Synchronous Baum-Welch • Detailed modeling of spectrogram near landmarks • Possibly better ACOUSTIC MODELING. • Explicit timing models • Explicit, efficient, fully integrated recognition of PROSODY. • bt(d) estimated from a long window • Possible use of asynchronous cues, e.g. AUDIO-VISUAL integration.

Conclusions • Speech information is dense near landmarks. • Infogram displays the distribution of information. • One-point spectral information • Two-point information using a greedy algorithm • Recognition may be possible using landmark-synchronous Baum-Welch

Acoustic Landmarks and Speech Information