240 likes | 250 Views
This talk explores the location of speech information in acoustic phonetics and psychophysics, and its relevance to speech recognition. It discusses the typology of acoustic landmarks and the encoding of phonemes using distinctive features. Infograms and two-point infograms are introduced as tools for analyzing speech information density. The talk concludes with an overview of the landmark-synchronous Baum-Welch algorithm and its potential advantages.
E N D
Speech Information at Acoustic Landmarks Mark Hasegawa-Johnson Electrical and Computer Engineering, UIUC
Outline of this Talk • Where is Speech Information? • A Typology of Acoustic Landmarks • Phoneme Encoding w/Distinctive Features • Infograms • Two-point Infograms • Entropy and Average Classification • Landmark-Synchronous Baum-Welch
Where is Speech Information? • According to Acoustic Phonetics • Vowel information is in the STEADY STATE • Consonant information is near TRANSITIONS • According to Speech Psychophysics • Speech w/o steady state is intelligible • Speech w/o transitions is unintelligible • According to Speech Recognition • Transition-anchored observations outperform segment-anchored observations.
Typology of Acoustic Landmarks • ACOUSTIC LANDMARK = • Perceptually salient acoustic event • near which phonetic information is dense. • Examples • Consonant Release: ga, ma, sa • Consonant Closure: egg, em, ess • Manner Change: agfa, anfa, asfa, asna • Counter-Examples: Non-Landmarks • Place Change: agda, aftha, amna, this ship
Information about GLIDE is most dense at: Point of maximum constriction • Middle of intersyllabic glide aya, a letter • Start of syllable-initial glide tra • End of syllable-final glide art
Information about a VOWEL is dense near: • 1. On-glide and off-glide (GLIDE landmarks!) • 2. VOWEL LANDMARK: pick a reference time near center of the steady state
Acoustic Landmarks Chosen for the Infogram Experiments • Releases, Closures, Manner Change: as marked in TIMIT • Glide, Flap, or /h/ Pivot Landmark: • Syllable-initial: at START segment boundary • Syllable-final: at END segment boundary • Intersyllabic: halfway through • Vowel Pivot Landmark: • 33% of the distance from START to END
Phoneme Encoding for Infogram Experiments • Encode w/Binary Distinctive Features • /s/ = [+consonantal, -sonorant, +continuant, +strident, +blade, +anterior, -distributed, +stiffvocalfolds, -grave, +fricative] • Feature hierarchy: infogram is based only on syllables for which feature is salient. • Redundant features: partial solution to the problem of context-dependent acoustic implementation.
Distinctive Features in TIMIT Features of a segment are determined based on left, center, and right phones. Once determined, features are attached to any landmarks caused by the center phoneme. [-sonorant] [-continuant] [+strident]
Infograms • Joint probability p(x,d) estimated from TIMIT TRAIN corpus. • Feature takes value d (-1 or +1). • X(t,f), t=0 at landmark, takes value x, x quantized to 23 levels. • Infogram is the mutual information:
Two-Point Infogram • Maximize ID(t1,f1) • Find Joint PMF of d, x, and X(t2,f2)=y • Calculate Two-Point Infogram:
Conditional Entropy and Average Classification • Infogram ID(t,f )=HD - HD|X(t,f) • a priorientropy HD • min p(d)= f(HD) is classification error probability given NO OBSERVATIONS • conditional entropy HD|X(t,f) • f(HD|X(t,f)) similar to log-average error probability given ONE OBSERVATION at time t, frequency f.
Average Classification Error vs. Entropy [strident] a priori: 0.89 bits, p(error)=0.31 [strident] given one measurement: 0.29 bits, p(error)=0.05 [strident] given two measurements: 0.22 bits, p(error)=0.04
From Classification to Recognition • Histogram gives a classification probability bti(di)=p(di|Li, ti, Xi(t,f) ) • di = vector of feature values e.g. di= [ +consonantal, -sonorant, +continuant, ...] • Li = landmark type e.g. release, closure, pivot • ti = landmark time • Xi(t,f) = spectrogram:t =t- ti • Duration probabilities define a transition a(ti,tk )=p( Lkat tk | Liat ti )
Landmark-Synchronous Baum-Welch Algorithm Probability of transcription [L1,d1,L2,d2,] given an observation matrix X:
Whats Different About the LM-Synchronous Baum-Welch? • Traditional: • Time is independent variable: t=1,2,3,... • Phonetic State is dependent variable, governed by transition probabilities: aik • Landmark-synchronous: • Phonetic Landmark is independent variable: Li=L1, L2, L3,... • Time is dependent variable, governed by transition probabilities: a(ti,tk )
Potential Advantages of LM-Synchronous Baum-Welch • Detailed modeling of spectrogram near landmarks • Possibly better ACOUSTIC MODELING. • Explicit timing models • Explicit, efficient, fully integrated recognition of PROSODY. • bt(d) estimated from a long window • Possible use of asynchronous cues, e.g. AUDIO-VISUAL integration.
Conclusions • Speech information is dense near landmarks. • Infogram displays the distribution of information. • One-point spectral information • Two-point information using a greedy algorithm • Recognition may be possible using landmark-synchronous Baum-Welch