Speech Recognition

Speech Recognition Mital Gandhi Brian Romanowski

Objective - Speech Recognition • Isolated Word Recognition • Portable and Fast

System Block Diagram

Recognition – Conceptually • Data Acquisition • Training Hidden Markov Models for word set • Recognition & Analysis

Theory – Hidden Markov Models • Used to model semi-stationary random processes, like speech • Example: • cat = / k a t /

Viterbi-based Recognition • Calculates the log-maximum likelihood of a series of observations given a particular HMM. • “Which model did this set of data most likely come from?” • Saves time by calculating only a subset of possible paths through the HMM network. • At each new frame, only the most likely transition/observation state pairs are used. • Concepts similar to Dynamic Time Warping

System Components I Volume Box • Sound Input • Amplifier • Reference Voltage • Resistor network (Voltage Dividers) • Voltage followers • Comparator • Microphone voltage vs. Reference • Output • LED bargraph

System Components II Hidden Markov Modeling ToolKit • Data Acquisition • Data Preparation • Parameter Enhancements • Recognition & Analysis

System Components II (cont.) HTK: Data Acquisition & Preparation • Data Acquisition • Recording using HSLab • Live audio input using HVite • Data Preparation • External files: dictionary, config, word lists • Initialization of prototype models (HCompV)

System Components II (cont.) HTK: Sample External Files • Config • Prototype Model

System Components II (cont.) HTK: Training & Recognition • HERest – parameter re-estimation and enhancement tool • Uses information from the energy, delta, & acceleration features in the cepstral domain • HVite for Recognition • Recognition of pre-recorded files or live audio input • A host of external files to support the recognition • Analysis tool HResults to compute accuracy & correctness results

System Components II (cont.) HTK: Results & Analysis • HResults • Computes % values for recognition accuracy and correctness • Results Analysis • NREF = percentage of reference labels correctly recognized • Correction does not penalize for insertion errors

System Components II (cont.) HTK: Preliminary Results ====================== HTK Results Analysis ====================== Date: Mon Sep 30 16:50:59 2002 Ref : 4word_word.mlf Rec : recout.mlf ------------------------ Overall Results -------------------------- SENT: %Correct=25.00 [H=1, S=3, N=4] WORD: %Corr=25.00, Acc=25.00 [H=9, D=0, S=3, I=0, N=12] ======================

System Components II (cont.) HTK: Techniques, Solutions • Input File Specifications • Config • Cepstral mean subtraction, energy enormalization • Prototype model • Number of states per word model • “Optimality” in transition probability assignments (matrix) • Data • “Noise-free” data • As many tokens/samples of each word for training

DSP – System Overview • Initialization • Threshold/Recording • MFCC • Viterbi • Output

DSP - Matlab • Prototype of all important algorithms • Pre-calculated data • Run-time altering of data (debugging) • Downloading and visualization of data • MFCCs

DSP – Recording/Thresholding • Speech Input • Process • Poll A/D for input data (TI-provided code used) • Take only one channel as input • Downsample • Save samples only when signal threshold has been crossed • Lead buffer • Tail buffer • PROBLEMS • Sample transfer modes, single channel selection, threshold values, external microphones • TESTING • Visual and audio inspection in Matlab

DSP – MFCC calculation (1) • Thank You to Takuya Ooura for his Public Domain FFT code. • MFCCs provide an uncorrelated and small set of observation vectors for the HMMs • Process: • Remove DC gain • Pre-emphasize • Hamming window • FFT magnitude • Mel-filter bank • DCT • Lifter

DSP – MFCC calculation (2) • PROBLEMS: • An incorrectly coded pre-emphasis filter • TESTING: • Graphically compared DSP generated MFCCs to: • Matlab MFCCs -> DSP numerical issues • HTK MFCCs -> reference implementation

DSP – Viterbi/Recognition • Uses HTK derived HMMs whose data is contained in a Matlab-generated #include file • PROBLEMS • Numerical concerns • Errors in deriving and coding the formulas.

Final Component Results I: HTK • Pre-recorded Files: ====================== HTK Results Analysis ====================== Date: Mon Dec 02 11:37:46 2002 Ref : testwords.mlf Rec : testwordsoutput.mlf ------------------------ Overall Results -------------------------- SENT: %Correct=94.85 [H=92, S=5, N=97] WORD: %Corr=98.28, Acc=98.28 [H=286, D=0, S=5, I=0, N=291] ====================== • Live Audio Input: ~ 83% • DSP MFCC Files: ~ 65 %

Final Component Results II: DSP • 95% recognition accuracy over 90 trials • 4 words • Trained speaker • Speaker Independence • Indication of some recognition for non-modeled speakers, but not much • Speech => Decision takes approximately 0.88 seconds

Challenges • Speed • Complex project • System integration • Microphone input • Volume Box • HTK • MATLAB & DSP

Recommendations • HTK and DSP • Larger training corpus • Multiple Gaussian mixtures • Channel independence • Continuous Recognition • Real-time MFCC transmission from DSP to HTK • DSP • Code style-fixes • Better user interface

Thank You • Dan Block – For use of his lab and equipment

DSP – MFCC calculation • Thank You to Takuya Ooura for his Public Domain FFT code. • MFCC’s provide an uncorrelated and small set of observation vectors for the HMM’s • Process: • Remove DC gain • Pre-emphasize • Hamming window • FFT magnitude • Mel-filter bank • DCT • Lifter • PROBLEMS: • An incorrectly coded pre-emphasis filter • TESTING: • Graphically compared DSP generated MFCC’s to: • Matlab MFCC’s -> DSP numerical issues • HTK MFCC’s -> reference implementation

Speech Recognition