150 likes | 329 Views
Keyword Spotting Dynamic Time Warping. Ali Akbar Jabini Alexandre Mercier-Dalphond Spring 2006. Introduction. Speech recognition: Computer can interpret speech Need input to digitalize sounds Microphone People can speak faster than type Commercial systems available since 1990s
E N D
Keyword SpottingDynamic Time Warping Ali Akbar Jabini Alexandre Mercier-Dalphond Spring 2006
Introduction • Speech recognition: • Computer can interpret speech • Need input to digitalize sounds • Microphone • People can speak faster than type • Commercial systems available since 1990s • People prefer Physical interactions • Keyboard/Mouse, On/Off switch • Low Accuracy for large vocabulary with noise (50%)
Introduction • Speech recognition is more and more used for smaller vocabulary banks • Credit Card Systems • Simple switching commands • Directory assistance • Cheap to implement • High Accuracy • Can verify their interpretation • Idea: speech recognition for household appliances
OUTLINE • Area of investigation • Concrete task/Goal • Schematic • Feature extraction • DTW • Training • Evaluation metrics • Conclusion
Area of Investigation • Keyword Spotting: • Subfield of speech recognition • Grammar constrained • Keyword Spotting in isolated word recognition • Keywords utterances • Keyword separated by silence • Main technique is DTW
Concrete task/Goal • Goal: develop a robust speaker independent keyword spotting scheme to operate household appliances • Concrete tasks • Digitalize the sound inputs • Implementation in MatLab • Train the model with the grammar • Analyze the performances of our scheme
Schematic Microphone A/D Feature extraction DTW Output Grammar
Feature extraction • Pre-emphasis • Flattening the spectrum of the signal • Blocking into frames • Length of the Fourier Transform • Windowing • Sample window (maybe Hamming) • Mel frequency Cepstral coefficients • More reliable than LPC coefficients • This will be imputed in the DTW algorithm
DTW • Idea: smallest distance between an input and the training bank • Cepstrum features • Dynamic programming: the time axis his not linear to account for utterances • t0 -> t0+5 • t1 -> t1-2
Training • Need to create our own grammar • On: Onnn, Honnn, open, opeeenn • Off: Hooofff, Hoff, offfff, close • As many potential utterances as possible • Use this data with DTW
Evaluation metrics • Accuracy • High noise • Low noise • Independent speaker • Training data speaker • Would like to obtain 80% or more
Conclusion • Early stage • No code implemented yet • Many challenges a head • Our methodology may change slightly • There is a big potential market for such technique -> influence on every day life.