270 likes | 396 Views
Automatic speech recognition using an echo state network. Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida, Gainesville, FL, USA May 10, 2006. 2000. CNEL Seminar History. Ratio spectrum, Oct. 2000 HFCC, Sept. 2002
E N D
Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida, Gainesville, FL, USA May 10, 2006
2000 CNEL Seminar History • Ratio spectrum, Oct. 2000 • HFCC, Sept. 2002 • Bats, Dec. 2004 • Electrohysterography, Aug. 2005 • Echo state network, May 2006 2006
Overview • ASR motivations • Intro to echo state network • Multiple readout filters • ASR experiments • Conclusions
ASR Motivations • Speech is most natural form of communication among humans. • Human-machine interaction lags behind with tactile interface. • Bottleneck in machine understanding is signal-to-symbol translation. • Human speech a “tough” signal: • Nonstationary • Non-Gaussian • Nonlinear systems for production/perception How to handle the “non”-ness of speech?
ASR State of the Art • Feature extraction: HFCC • bio-inspired frequency analysis • tailored for statistical models • Acoustic pattern rec: HMM • Piecewise-stationary stochastic model • Efficient training/testing algorithms • …but several simplistic assumptions • Language models • Uses knowledge of language, grammar • HMM implementations • Machine language understanding still elusive (spam blockers)
Hidden Markov Model Premier stochastic model of non-stationary time series used for decision making. Assumptions: 1) Speech is piecewise-stationary process. 2) Features are independent. 3) State duration is exponential. 4) State transition prob. function of previous-next state only. Can we devise a better pattern recognition model?
Echo State Network • Partially trained recurrent neural network, Herbert Jaeger, 2001 • Unique characteristics: • Recurrent “reservoir” of processing elements, interconnected with random untrained weights. • Linear readout weights trained with simple regression provide closed-form, stable, unique solution.
ESN Matrices • Win: untrained, M x Min matrix • Zero mean, unity variance normally distributed • Scaled by rin • W: untrained, M x M matrix • Zero mean, unity variance normally distributed • Scaled such that spectral radius r < 1 • Wout: trained, linear regression, Mout x M matrix • Regression closed-form, stable, unique solution • O(M2) per data point complexity
Echo States Conditions • The network has echo states if x(n) is uniquely determined by left-infinite input sequence …,u(n-1),u(n). • x(n) is an “echo” of all previous inputs. • If f is tanh activation function: • If σmax(W)=||W||<1, guarantees echo states • If r=|λmax(W)|>1, guarantees no echo states
ESN Training • Minimize mean-squared error between y(n) and desired signal d(n). Wiener solution:
ESN Example: Mackey-Glass M=60 PEs r=0.9 rin=0.3 u(n): MG, 10000 samples d(n)=u(n+1) Prediction Gain (var(u)/var(e)): Input: 16.3 dB Wiener: 45.1 dB ESN: 62.6 dB
Multiple Readout Filters • Wout projects reservoir space to output space. • Question: how to divide reservoir space and use multiple readout filters? • Answer: competitive network of filters • Question: how to train/test competitive network of K filters? • Answer: mimic HMM.
Segmental K-means: Init For each input, xi(n) and desired di(n) for sequence i: Divide x,d into equal-sized chunks Xη,Dη (one per state). For each n, select k(n)[1,K] uniform random. After init. with all sequences:
Segmental K-means: Training • For each utterance: • Produce MSE for each readout filter. • Find Viterbi path through MSE matrix. • Use features from each state to update auto- and cross-correlation matrices. • After all utterances: Wiener solution • Guaranteed to converge to local minimum in MSE over training set.
ASR Example 1 • Isolated English digits “zero” - “nine” from TI46 corpus: 8 male, 8 female, 26 utterances each, 12.5 kHz sampling rate. • ESN: M=60 PEs, r=2.0, rin=0.1, 10 word models, various #states and #filters per state. • Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis (α=0.95), CMS, Δ+ΔΔ (±4 frames) • Pre-processing: zero-mean and whitening transform • M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training • Two to six training epochs for all models • Desired: next frame of 39-dimension features • Test: corrupted by additive noise from “real” sources (subway, babble, car, exhibition hall, restaurant, street, airport terminal, train station) • Baseline: HMM with identical input features
ASR Results, noise free Number of classification errors out of 518 (smaller is better).
ASR Results, noisy Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):
ASR Results, noisy Single mixture per state (K=1): ESN classifier
ASR Results, noisy Single mixture per state (K=1): HMM baseline
ASR Example 2 • Same experiment setup as Example 1. • ESN: M=600 PEs, 10 states, 1 filter per state, rin=0.1, various r. • Desired: one-of-many encoding of class, ±1, tanh output activation function AFTER linear readout filter. • Test: corrupted by additive speech-shaped noise • Baseline: HMM with identical input features
Discussion • What gives the ESN classifier its noise-robust characteristics? • Theory: ESN reservoir provides context of noisy input, allowing reservoir to reduce effects of noise by averaging. • Theory: Non-linearity and high-dimensionality of network increases linear separability of classes in reservoir space.
Future Work • Replace winner-take-all with mixture-of-experts. • Replace segmental K-means with Baum-Welch-type training algorithm. • “Grow” network during training. • Consider nonlinear activation functions (e.g., tanh, softmax) AFTER linear readout filter.
Conclusions • ESN classifier using inspiration from HMM: • Multiple readout filters per state, multiple states. • Trained as competitive network of filters. • Segmental K-means guaranteed to converge to local minimum of total MSE from training set. • ESN classifier noise robust compared to HMM: • Ave. over all sources, 0-20 dB SNR: +21 percentage points • Ave. over all sources: +9 dB SNR