140 likes | 328 Views
Wake-up Word Detector. Douglas Rauscher ECE5525 April 30, 2008. Introduction. The purpose of this project is to generate feature vectors and Hidden Markov Models for a single word Data is processed using Sphinx and Matlab The Wake-up Word chosen is “Help”. Corpus.
E N D
Wake-up Word Detector Douglas Rauscher ECE5525 April 30, 2008
Introduction • The purpose of this project is to generate feature vectors and Hidden Markov Models for a single word • Data is processed using Sphinx and Matlab • The Wake-up Word chosen is “Help” Douglas Rauscher
Corpus • The corpus used is the original WUW_Corpus, provided on the ECE5526 server: ftp://163.118.203.219/CORPORA/WUW_Corpora/WUW_Corpus/ • This corpus was used because single utterances of the word “Help” were frequent in the data set • Data is in µ-law format Douglas Rauscher
File lists & Transcriptions • Before processing in Sphinx, “transcription” and “fileids” files need to be created: • wuw_corpus_train.fileids • wuw_corpus_train.transcription • wuw_corpus_test.fileids • wuw_corpus_test.transcription • These were created in Matlab by searching the given “|”-delimited file for “Help” utterances. • 80% of “Help” utterances were used in the training list. The remaining 20% were used in the test list. • All utterances that did not contain “Help” were included in the test set to test for false alarms. • A handful of the utterances in the original .trans file were manually removed from the list because either • They had no data bytes in the file • Sphinx had trouble with the sound quality • The utterance was cut off in such a way that Sphinx threw an error Douglas Rauscher
close all; clear all; clc; A = textread('C:\CMUtutorial\WUW_Corpus\wuw.trans','%s','delimiter','|'); idx = 1:length(A); idx = idx((strcmp(A,'Male')+strcmp(A,'Female'))>0); gender = A(idx); dialect = A(idx+1); phone_type = A(idx+2); filename = A(idx+3); CallNO = A(idx+4); UttNO = A(idx+5); Ortho = A(idx+6); AllIdx = 1:length(Ortho); HelpIdx = AllIdx(strcmp(Ortho,'Help')); NotHelpIdx = AllIdx(~strcmp(Ortho,'Help')); N = floor(length(HelpIdx)*0.8); fout = fopen('C:\CMUtutorial\WUW_Corpus\etc\wuw_corpus_train.fileids','w'); ftsn = fopen('C:\CMUtutorial\WUW_Corpus\etc\wuw_corpus_train.transcription','w'); for k=1:N fprintf(fout,'calls/%s/WUW%s_%s\n',char(filename(HelpIdx(k))),... char(filename(HelpIdx(k))),... char(CallNO(HelpIdx(k)))); fprintf(ftsn,'<s> %s </s> (WUW%s_%s)\n',upper(char(Ortho(HelpIdx(k)))),... char(filename(HelpIdx(k))),... char(CallNO(HelpIdx(k)))); end fclose(fout); fclose(ftsn); fout = fopen('C:\CMUtutorial\WUW_Corpus\etc\wuw_corpus_test.fileids','w'); ftsn = fopen('C:\CMUtutorial\WUW_Corpus\etc\wuw_corpus_test.transcription','w'); % Remaining "Help" for k=(N+1):length(HelpIdx) fprintf(fout,'calls/%s/WUW%s_%s\n',char(filename(HelpIdx(k))),... char(filename(HelpIdx(k))),... char(CallNO(HelpIdx(k)))); fprintf(ftsn,'%s (WUW%s_%s)\n',upper(char(Ortho(HelpIdx(k)))),... char(filename(HelpIdx(k))),... char(CallNO(HelpIdx(k)))); end % Other utterances for k=1:length(NotHelpIdx) fprintf(fout,'calls/%s/WUW%s_%s\n',char(filename(NotHelpIdx(k))),... char(filename(NotHelpIdx(k))),... char(CallNO(NotHelpIdx(k)))); fprintf(ftsn,'%s (WUW%s_%s)\n',upper(char(Ortho(NotHelpIdx(k)))),... char(filename(NotHelpIdx(k))),... char(CallNO(NotHelpIdx(k)))); end fclose(fout); fclose(ftsn); dcr_extract.m Douglas Rauscher
Data preparation • Corpus data was originally: • file extension .ulaw • 8-bit µ-law format • 8kHz sample rate • This data must be converted, as .ulaw files are not readable by Sphinx. • Format chosen to convert to: • File extension .raw • 16-bit linear quantization • 16kHz (linearly interpolated) Douglas Rauscher
ulaw2raw.m for k=0:252 ulaw2raw(sprintf('C:\\CMUtutorial\\WUW_Corpus\\calls\\%05d\\',k),0); end function ulaw2raw(filepath,playflag) % ulaw2raw('C:\CMUtutorial\WUW_Corpus\calls\00000\'); cd_save = cd; cd(filepath); files = dir; % US standard u-law coeff u=255; for k=3:length(files) if (files(k).isdir==0) && (strcmp(files(k).name(end-4:end),'.ulaw')) disp(files(k).name); fin = fopen(files(k).name,'r'); A = fread(fin,'int8'); % move data to proper sign A1 = A.*(A<=0)+(127-A).*(A>0); % remove u-law B1 = sign(A1).*(1/u).*(((1+u).^abs(A1/128))-1); B2 = reshape([B1,((B1+[B1(2:end);0])./2)].',1,[]); if(playflag) sound(B2,16000) pause(length(B2)/16000); end fclose(fin); generateRawWav(files(k).name(1:end-5),B2); end end cd(cd_save); function generateRawWav(filename,data) fout = fopen(strcat(filename,'.raw'),'w'); dataq = round(32768.*data./128); fwrite(fout,dataq,'int16'); fclose(fout); Douglas Rauscher
Language model creation • For a Wake-up Word recognizer, a language model is not particularly desirable in detecting the word. • Sphinx allows you to weight the priority of the language model in it’s calculations, but does not appear to allow the user to disable the language model all together. • Therefore, to avoid errors, a custom language model had to be created. • The lm tool generator was used to convert a text file that contained only the word “Help” to a .lm file. http://www.speech.cs.cmu.edu/tools/lmtool.html • The lm3g2dmp tool was used to convert the .lm file to .lm.DMP format. run cmd cd C:\CMUtutorial\lm3g2dmp\Debug> lm3g2dmp 7092.lm ./ Douglas Rauscher
Training the Model • Sphinx Training Configuration file was edited to use proper input files • The Max Number of Gaussians was set to 8 • The Number of HMM States was increased from 3 to 5, without significant improvement • Sphinx commands: cd c:/CMUtutorial/WUW_Corpus/ perl scripts_pl/make_feats.pl -ctl etc/wuw_corpus_train.fileids -cfg etc/sphinx_train.cfg -param etc/feat.params perl scripts_pl/RunAll.pl Douglas Rauscher
Testing the Model • Sphinx Testing Configuration file was edited to use proper input files. • Language model weight was set to “1” (the lowest allowable setting) • Number of Gaussians was set to 8 to match the training configuration • Sphinx commands: perl scripts_pl/make_feats.pl -ctl etc/wuw_corpus_test.fileids -cfg etc/sphinx_decode.cfg -param etc/feat.params perl scripts_pl/decode/slave.pl Douglas Rauscher
Sphinx Output • Sphinx was used to calculate Acoustic Scoring only, not to perform thresholding. • These resulting scores were parsed in Matlab and PDF/CDF plots were generated. • See attached output document for raw Cygwin output Douglas Rauscher
plotDistributions.m % plotDistributions clear all; clc; close all; fn = 'C:\CMUtutorial\WUW_Corpus\logdir\decode\wuw_corpus-1-1.log'; RawText = textread(fn,'%s'); idx = []; for k=1:(length(RawText)-6) if(~isempty(findstr(char(RawText(k)),'fv:')) &&... strcmp(char(RawText(k+1)),'HELP')) idx = [idx; k:k+7]; end end RawText = RawText(idx); % fetch and plot Acoustic Score histograms HelpAScr = []; FalsAScr = []; for k=1:size(RawText,1) if(findstr(char(RawText(k,1)),'_008>')) % True HELP HelpAScr = [HelpAScr str2num(char(RawText(k,5)))]; else % Not a HELP FalsAScr = [FalsAScr str2num(char(RawText(k,5)))]; end end mn = min(min(HelpAScr),min(FalsAScr)); mx = max(max(HelpAScr),max(FalsAScr)); vals = mn:((mx-mn)/100):mx; HelpAScrHist = hist(HelpAScr,vals); HelpAScrHist = HelpAScrHist./sum(HelpAScrHist); FalsAScrHist = hist(FalsAScr,vals); FalsAScrHist = FalsAScrHist./sum(FalsAScrHist); for k=1:length(vals) HelpAScrCDF(k) = sum(HelpAScrHist(1:k)); FalsAScrCDF(k) = sum(FalsAScrHist(k:end)); end figure; subplot(2,1,1); plot(vals,HelpAScrHist,'b',vals,FalsAScrHist,'r'); title('Probability Density Function') legend('Help','Other Utterances') axis([mn,mx,0,1.1*max(max(HelpAScrHist),max(FalsAScrHist))]); subplot(2,1,2); plot(vals,HelpAScrCDF, 'b',vals,FalsAScrCDF, 'r'); title('Cumulative Distribution Function') axis([mn,mx,0,1.1]); Douglas Rauscher
plotDistributions.m Douglas Rauscher
Conclusions • Sphinx had problems correctly detecting the word “Help” in this test, but there is clearly a decent model created. • The test set was rather constrained and limited, and would benefit from a much larger sampling of “Help” utterances. • Sphinx features that would have been nice: • Native .ulaw file input • Simpler mechanism to input sample rate • Native text file input for language model, by integrating the .lm generator and .lm.DMP converter into Sphinx. • Better handling of utterance fragments Douglas Rauscher