Landmark-Based Speech Recognition

Landmark-Based Speech Recognition Mark Hasegawa-Johnson Carol Espy-Wilson Jim Glass Steve Greenberg Katrin Kirchhoff Mark Liberman Partha Niyogi Ken Stevens

What are Landmarks? • Instants of perceptual importance (human speech recognition accuracy drops if a 50ms segment is deleted). - not according to Miller and Licklider 1950; Huggins, 1975. I would drop this argument about short instants in time and focus on landmarks of variable duration (30-150 ms) • Instants of high mutual information between phone and signal (maxima of I(q;X(t,f)) ). • Potential universality of certain acoustic landmarks - cross linguistic and speaking style transfer; noise robustness Where do these things happen? • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure Perceptual experiments: Strange, 1989 I(q;X(t,f)) experiment: Hasegawa-Johnson, 2000

Landmark-Based Speech Recognition MAP transcription: … backed up … Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA

Stop Detection using Support Vector Machines False Acceptance vs. False Rejection Errors per 10ms frame, Four Types of Stop Detectors - Add “take home message” on this slide (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Kernel SVM: Equal Error Rate=0.13% Niyogi & Burges, 1999, 2002

Manner Class Recognition Accuracy(Juneja and Espy-Wilson, 2003)

Small-Vocabulary Word Recognition Using Landmarks: Results on TIDIGITS TIDIGITS recognition, using SVMs trained on TIMIT: • Manner-Class HMM: 53% WRA • SVM Landmark Detectors: 76% WRA (Juneja and Espy-Wilson, 2003)

Lexical Notation: What are “Distinctive Features?” MANNER FEATURES: +sonorant +continuant = Vowel, Glide +sonorant –continuant = Nasal, /l/ –sonorant +continuant = Fricative –sonorant –continuant = Stop

Distinctive Feature Lexicon • Based on ICSI train-ws97 Switchboard transcriptions • Compiled to a lexicon using Fosler-Lussier’s babylex lexical compiler • Converted to landmarks using Hasegawa-Johnson’s perl transcription tools Landmarks in blue, Place and voicing features in green. AGO(0.441765) +syllabic+reduced +back (syllable nucleus) ↓continuant↓sonorant+velar +voiced (stop closure) ↑continuant↑sonorant +velar +voiced (stop release) +syllabic–low –high +back +round +tense (syllable nucleus) AGO(0.294118) +syllabic+reduced –back (syllable nucleus) ↓continuant↓sonorant+velar +voiced (stop closure) ↑continuant↑sonorant+velar +voiced (stop release) +syllabic–low –high +back +round +tense (syllable nucleus)

Noise Robustness of MLP-BasedDistinctive Feature Detectors • Each Distinctive Feature relies on different acoustic observations • Acoustic diversity can improve word recognition accuracy in noise: • 10% WRA improvement at 0 dB SNR. (Kirchhoff, 1999)

Noise Robustness of Distinctive Features: Pink Noise Articulatory feature classification more robust than phone classification at low SNRs (Chang, Shastri and Greenberg, 2001)

Noise Robustness of Distinctive Features: White Noise (Chang, Shastri and Greenberg, 2001)

Research Goals: Summer 2004 • Switchboard: • Train landmark detectors • Test: manner-class recognition • Word Lattice Rescoring using landmark detection probabilities • Noise • Manner class recognition, babble noise, 0dB • Word lattice rescoring with noisy observations

Experiment #1: Training and Manner Class Recognition on Switchboard • Currently Existing Infrastructure (11/2003): • SVM training code – libsvm • Forced-alignment of landmarks to phonetically untranscribed data – Espy-Wilson • Landmark-based dictionaries for Switchboard – Hasegawa-Johnson • TIMIT-trained SVMs – Espy-Wilson, Hasegawa-Johnson • Phonetic transcriptions of WS97 test data – Greenberg • Interactive code for viewing transcriptions and observations – xwaves, matlab • Infrastructure Prepared Prior to Summer 2004: • Diverse acoustic observations for all of Switchboard, including MFCC-based, broadband spectral energies, and sub-band periodicity. • Experiment Schedule, Summer 2004: • Week 1: Test TIMIT-trained MFCC-based SVMs on WS97 data. Retrain and re-test. Error analysis. • Week 2: Train and test using alternative acoustic observations.

Reminder Slide: Manner Class Recognition Accuracy on TIMIT

Experiment #2: Lattice Rescoring • Infrastructure Prepared Prior to Summer 2004: • Word Recognition Lattices for Switchboard test corpus (Byrne and Makhoul have both tentatively offered lattices) • “Pinched” lattices (time-aligned to ML transcription) • Code to learn SVM-MLP landmark detection probabilities • Efficient code for lattice rescoring • Code for lattice error analysis: locate phoneme and landmark differences between ML path and correct path, and tabulate by syllable position and manner features. • Experiment Schedule, Summer 2004: • Weeks 1-2: Train landmark detection probabilities, both MFCC-based and acoustically-diverse landmark detectors. Refine landmark-based dictionaries; retrain if necessary. • Weeks 3-4: Test lattice rescoring results as function of (1) acoustic observations, (2) dictionary type. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

Lattice Rescoring – Oracle Experiment Suppose all landmarks in the ICSI transcription were correctly recognized by the SVM. How much could the lattices be improved? N-Best Lattices, misc-ws97, 3-mixture 3-state HTK monophones, 19.8% WRA, 23000 arcs/second. Result: WRA not improved (19.5%). Example (WRA=1/10 both before and after): REF: HOW DID THIS WORKA MALE RAT HAD BEEN BOUGHT BEFORE: HIGH KIDS ARE TERM YOU'RE AT A DO BY AFTER: HIGH TO HIS WORK IN YOUR AT A DO BY Possible resolution: preprocess the lattices using Byrne’s method – reduce ambiguity to a level that can be addressed using manner features.

Experiment #3: Noise • Infrastructure Prepared Prior to Summer 2004: • Switchboard waveform data, in babble, 10 dB and 0 dB SNR. • Acoustic observation files (MFCC and diverse observations) created from all noisy waveform files. • Experiment Schedule, Summer 2004: • Weeks 3-4: Train landmark detectors using noisy speech data. Test landmark detectors in the task of manner class recognition. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

Summary • Landmarks: a somewhat different view of the speech signal. • Integration with existing systems via lattice rescoring. • Probable benefits: • Low parameter count • High manner-class recognition accuracy • Acoustic diversity → noise robustness • Costs: Novel theory, e.g., • Label sequence SVM: convergence not yet guaranteed. • Landmark detection probabilities are discriminant; pronunciation model is a likelihood. • The costs are also benefits: a successful workshop could spawn important research.

Citations S Chang, L Shastri, and S Greenberg, “Robust phonetic feature extraction under a wide range of noise backgrounds and signal-to-noise ratios,” Workshop on Consistent and Reliable Acoustic Cues for Sound Analysis, Aalborg, Denmark, 2001. M Hasegawa-Johnson, “Time-Frequency Distribution of Partial Phonetic Information Measured Using Mutual Information,” ICSLP 2000. A Juneja, Speech recognition using acoustic landmarks and binary phonetic features classifiers, PhD Thesis Proposal, University of Maryland, August 2003. A Juneja and C Espy-Wilson, “Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines,” International Joint Conference on Neural Networks, 2003. K Kirchhoff, G Fink and G Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition.” Speech Communication, May, 2002. K Kirchhoff, Robust Speech Recognition Using Articulatory Information, PhD thesis, University of Bielefeld, Germany, July 1999. P Niyogi, C Burges, and P Ramesh, “Distinctive Feature Detection Using Support Vector Machines,” ICASSP 1999. P Niyogi and C Burges, Detecting and Interpreting Acoustic Features by Support Vector Machines, University of Chicago Technical Report TR-2002-02, available on the internet at http://www.cs.uchicago.edu/research/publications/techreports/TR-2002-02 KN Stevens, SY Manuel, S Shattuck-Hufnagel and S Liu, “Implementation of a Model for Lexical Access Based on Features,” ICSLP, 1992. W Strange, JJ Jenkins and TL Johnson, “Dynamic Specification of Coarticulated Vowels,” Journal of the Acoustical Society of America 74(3):695-705, 1983.

Landmark-Based Speech Recognition

Landmark-Based Speech Recognition

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

Speech recognition

Speech Recognition

Landmark-Based Speech Recognition: Status Report, 7/21/2004

Speech Recognition

Speech Recognition

Articulatory Feature-Based Speech Recognition

Articulatory Feature-Based Speech Recognition

Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Landmark-Based Speech Recognition

A Game Based on Speech Recognition

Speech Recognition

Speech Recognition

Articulatory Feature-Based Speech Recognition

Landmark-Based Speech Recognition Report of the Workshop Group, 8/16/2004

Landmark-Based Speech Recognition: Status Report, 7/21/2004

Articulatory Feature-Based Speech Recognition