330 likes | 432 Views
Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [lindberg@cpk.auc.dk]. Outline. Part 1 - Acoustic modeling Reference recogniser (COST 249)
E N D
Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [lindberg@cpk.auc.dk]
Outline Part 1 - Acoustic modeling Reference recogniser (COST 249) Part 2 - Directory assistance NaNu - Names & Numbers (Tele Danmark) Acoustic model optimisation Project- and system details
COST 249 The COST 249 SpeechDat Multilingual Reference Recogniserhttp://www.telenor.no/fou/prosjekter/taletek/refrec F.T. Johansen, N. Warakagoda (Telenor, Kjeller, Norway), B. Lindberg (CPK, Aalborg, Denmark), G. Lehtinen (ETH, Zürich, Switzerland), Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor, Slovenia), B. Milner, D. Chaplin (British Telecom, Ipswich, UK), K. Elenius, G. Salvi (KTH, Stockholm, Sweden), E. Sanders, F. de Wet (KUN, Nijmegen, The Netherlands)
What is the reference recogniser? Phoneme based recogniser design procedure Language-independent Fully automatic, one script works straight from CDs Standardised database format: SpeechDat(II) Available in many languages world wide Oriented towards telephone applications Commonly available recogniser toolkit: HTK
Motivation • A fast start for recognition research in new languages • Share experience, avoid doing the same mistakes • Improve state-of-the-art • Share research efforts • Provide a benchmark for recogniser performance comparison across tasks and languages • Facilitate true multilingual recognition research
Related Work COST 232 Assumed TIMIT-like segmented database Reference verification systems CAVE, PICASSO COST 250 GlobalPhone (Schultz & Waibel, ICSLP 98): Dictation type multilingual databases Language independent and -adaptive recognition
SpeechDat(II) databases 20 FDBs (fixed network), 5 MDBs (mobile networks) 500-5000 speakers, 4-8 minutes recording sessions Telephone information and transaction services Compatible databases: SpeechDat(E): 5 central and Eastern European languages SALA: 8 dialect zones in Latin America SpeechDat-Car: 9 languages, parallel GSM and in-car SpeechDat Australian English
Core Utterance Types in SpeechDat(II) number type corpus code 1 isolated digit items I 5 digit/number strings B,C 1+ natural numbers N 1 money amounts M 2 yes/no questions Q 3+ dates D 2 times T 3 application keywords/keyphrases A 1 word spotting phrase E 5 directory assistance names O 3 spellings L 4+ phonetically rich words W 9 phonetically rich sentences S 40+ In total
Recogniser design - version 0.95 Standard HTK tutorial features (39-dimensional MFCC_0_D_A), no normalisation Word internal triphone HMMs, 3 states per model Decision-tree state clustering Trained from flat-start using only orthographic transcriptions and a SpeechDat lexicon Remove “difficult” utterances from the training set 1,2,4,8,16 and 32 diagonal covariance Gaussian mixtures Re-training on re-segmented material
MFCC_0_D_A - feature set Pre-empasis 0.97 Frame shift 10 ms Analysis window Hamming Window length 25 ms Spectrum type FFT-magnitude Filterbank type Mel-scale Filter shape Triangular Filterbank channels 26 Cepstral coefficients 12 Cepstral liftering 22 Energy feature C0 Deltas 13 Delta-deltas 13 Total features 39
Test design Common test suite on SpeechDat I-test: Isolated digit recognition (SVIP) Q-test: Yes/no recognition (SVIP) A-test: Recognition of 30 isolated application words (SVIP) BC-test: Unknown length connected digit string recognition (SVWL) O-test: City name recognition (MVIP) W-test: Recognition of phonetically rich words (MVIP) Two test procedures used SVIP: Small Vocabulary Isolated Phrase MVIP: Medium Vocabulary Isolated Phrase SVWL: Small Vocabulary Word Loop, NIST alignment
Results Six labs have completed the training procedure on the SpeechDat(II) databases KUN has converted the Dutch Polyphone to SpeechDat(II) format: train only on phonetically rich sentences tests only on digit strings More details available on the web
Training Statistics * External information available (either session list, pronunciation lexicon or a phoneme mapping - see web-site)** Results are for Refrec. v. 0.93
Word error rates * Results are for Refrec. v. 0.93 Average numberof phonemes intest vocabularies
Language independent considerations Performance probably below state-of-the-art systems No whole-word modelling, no cross-word context (especially needed for connected digits) A lot of training data with noise has been removed No speaker noise of filled pause model Not robust enough feature analyser
Language differences Mobile database has 3-5 times the error rate of FDBs more robust modeling needed Slovenian: high noise level on recordings
Conclusion - part 1 Practical/logistic problems mostly solved Future work: Improve language and database coverage More speakers: Swedish 5000 More challenging tests, large vocabularies More analyses Improved training procedure, clustering
Directory assistance NaNu Børge Lindberg, Bo Nygaard Bai, Tom Brøndsted, Jesper Ø. Olsen • Recognition of ‘Names & Numbers’ • In collaboration with Tele Danmark • Auto attendant/directory assistance applications • Large vocabulary - for the first time in Danish • Exploiting the SpeechDat(II) database
Acoustic modeling - Decision trees (Ref: HTK Book)
NaNu Acoustic models • SpeechDat - COST 249 • 20k+ tied-mixture tri-phones, 6554 clusters • 16 mixture models - 100k+ mixture components Database • ¼ million subscribers (Århus and Næstved areas) Vocabulary extracted from database, for which: • there is a minimum of two occurences • transcription exists (Onomastica)
Vocabulary and Coverage NaNu Vocabulary # Unique database entries, Denmark (source Tele Danmark)
SLANG Recogniser - Spoken LANGuage • Speech Recognition Research Platform • For Dialogue Systems execution • Modular design and implementation (C++) • Frame synchronous operation • Dynamic Tree Structured Decoder • Optimised towards large vocabulary recognition (Gaussian mixture selection)
NLP • N-Best listsare parsed into semantic frames and SQL queries are generated according to the following strategy: 1. simple 1-best match 2. full search in all N-best lists 3. under specified (street name and last name required to be contained in the N-best list) • Output is “converted” to synthetic speech.
Dialogue System • Java implementation of dialogue system and telephony server. • uses SLANG speech recognition library in C++ • connects to public domain SQL database (mySQL) • system directed dialogue • one word pr. turn - high perplexity • dynamic, parallel allocation of recognisers
Performance Lack of test data - SpeechDat data were used (!) Person names task: • First name, optional middle name, last name • 434 test utterances (speaker independent) Results from predecessor configuration: (10646 last names, 2777 first/middle names): • Recognition accuracy 1-best : 39.1 %
Conclusion - Part 2 Real system probably needs application specific data - not mentioning the dialogue aspect ! Effect of further acoustic model optimisation (on SpeechDat) may be marginal, when N-best lists are used Limited number of pronunciation variants available Immediate steps are:- test data !- acoustic validation of retrieved candidates Mixed initiative dialogue - CPK’s incentive to work on NaNu !