1 / 33

Acoustic modeling on telephony speech corpora for directory assistance systems applications

Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [lindberg@cpk.auc.dk]. Outline. Part 1 - Acoustic modeling Reference recogniser (COST 249)

elmo
Download Presentation

Acoustic modeling on telephony speech corpora for directory assistance systems applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [lindberg@cpk.auc.dk]

  2. Outline Part 1 - Acoustic modeling Reference recogniser (COST 249) Part 2 - Directory assistance NaNu - Names & Numbers (Tele Danmark) Acoustic model optimisation Project- and system details

  3. COST 249 The COST 249 SpeechDat Multilingual Reference Recogniserhttp://www.telenor.no/fou/prosjekter/taletek/refrec F.T. Johansen, N. Warakagoda (Telenor, Kjeller, Norway), B. Lindberg (CPK, Aalborg, Denmark), G. Lehtinen (ETH, Zürich, Switzerland), Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor, Slovenia), B. Milner, D. Chaplin (British Telecom, Ipswich, UK), K. Elenius, G. Salvi (KTH, Stockholm, Sweden), E. Sanders, F. de Wet (KUN, Nijmegen, The Netherlands)

  4. What is the reference recogniser? Phoneme based recogniser design procedure Language-independent Fully automatic, one script works straight from CDs Standardised database format: SpeechDat(II) Available in many languages world wide Oriented towards telephone applications Commonly available recogniser toolkit: HTK

  5. Motivation • A fast start for recognition research in new languages • Share experience, avoid doing the same mistakes • Improve state-of-the-art • Share research efforts • Provide a benchmark for recogniser performance comparison across tasks and languages • Facilitate true multilingual recognition research

  6. Related Work COST 232 Assumed TIMIT-like segmented database Reference verification systems CAVE, PICASSO COST 250 GlobalPhone (Schultz & Waibel, ICSLP 98): Dictation type multilingual databases Language independent and -adaptive recognition

  7. SpeechDat(II) databases 20 FDBs (fixed network), 5 MDBs (mobile networks) 500-5000 speakers, 4-8 minutes recording sessions Telephone information and transaction services Compatible databases: SpeechDat(E): 5 central and Eastern European languages SALA: 8 dialect zones in Latin America SpeechDat-Car: 9 languages, parallel GSM and in-car SpeechDat Australian English

  8. Core Utterance Types in SpeechDat(II) number type corpus code 1 isolated digit items I 5 digit/number strings B,C 1+ natural numbers N 1 money amounts M 2 yes/no questions Q 3+ dates D 2 times T 3 application keywords/keyphrases A 1 word spotting phrase E 5 directory assistance names O 3 spellings L 4+ phonetically rich words W 9 phonetically rich sentences S 40+ In total

  9. Recogniser design - version 0.95 Standard HTK tutorial features (39-dimensional MFCC_0_D_A), no normalisation Word internal triphone HMMs, 3 states per model Decision-tree state clustering Trained from flat-start using only orthographic transcriptions and a SpeechDat lexicon Remove “difficult” utterances from the training set 1,2,4,8,16 and 32 diagonal covariance Gaussian mixtures Re-training on re-segmented material

  10. MFCC_0_D_A - feature set Pre-empasis 0.97 Frame shift 10 ms Analysis window Hamming Window length 25 ms Spectrum type FFT-magnitude Filterbank type Mel-scale Filter shape Triangular Filterbank channels 26 Cepstral coefficients 12 Cepstral liftering 22 Energy feature C0 Deltas 13 Delta-deltas 13 Total features 39

  11. Test design Common test suite on SpeechDat I-test: Isolated digit recognition (SVIP) Q-test: Yes/no recognition (SVIP) A-test: Recognition of 30 isolated application words (SVIP) BC-test: Unknown length connected digit string recognition (SVWL) O-test: City name recognition (MVIP) W-test: Recognition of phonetically rich words (MVIP) Two test procedures used SVIP: Small Vocabulary Isolated Phrase MVIP: Medium Vocabulary Isolated Phrase SVWL: Small Vocabulary Word Loop, NIST alignment

  12. Results Six labs have completed the training procedure on the SpeechDat(II) databases KUN has converted the Dutch Polyphone to SpeechDat(II) format: train only on phonetically rich sentences tests only on digit strings More details available on the web

  13. Training Statistics * External information available (either session list, pronunciation lexicon or a phoneme mapping - see web-site)** Results are for Refrec. v. 0.93

  14. A typical training curve

  15. Word error rates * Results are for Refrec. v. 0.93 Average numberof phonemes intest vocabularies

  16. Word error rates - cont.

  17. Word error rates - cont.

  18. Word error rates - cont.

  19. Language independent considerations Performance probably below state-of-the-art systems No whole-word modelling, no cross-word context (especially needed for connected digits) A lot of training data with noise has been removed No speaker noise of filled pause model Not robust enough feature analyser

  20. Language differences Mobile database has 3-5 times the error rate of FDBs more robust modeling needed Slovenian: high noise level on recordings

  21. Conclusion - part 1 Practical/logistic problems mostly solved Future work: Improve language and database coverage More speakers: Swedish 5000 More challenging tests, large vocabularies More analyses Improved training procedure, clustering

  22. Directory assistance NaNu Børge Lindberg, Bo Nygaard Bai, Tom Brøndsted, Jesper Ø. Olsen • Recognition of ‘Names & Numbers’ • In collaboration with Tele Danmark • Auto attendant/directory assistance applications • Large vocabulary - for the first time in Danish • Exploiting the SpeechDat(II) database

  23. Acoustic modeling - Decision trees (Ref: HTK Book)

  24. Acoustic modeling of Danish diphthongs

  25. Acoustic modeling - CMN

  26. Acoustic modeling - decision trees

  27. NaNu Acoustic models • SpeechDat - COST 249 • 20k+ tied-mixture tri-phones, 6554 clusters • 16 mixture models - 100k+ mixture components Database • ¼ million subscribers (Århus and Næstved areas) Vocabulary extracted from database, for which: • there is a minimum of two occurences • transcription exists (Onomastica)

  28. Vocabulary and Coverage NaNu Vocabulary # Unique database entries, Denmark (source Tele Danmark)

  29. SLANG Recogniser - Spoken LANGuage • Speech Recognition Research Platform • For Dialogue Systems execution • Modular design and implementation (C++) • Frame synchronous operation • Dynamic Tree Structured Decoder • Optimised towards large vocabulary recognition (Gaussian mixture selection)

  30. NLP • N-Best listsare parsed into semantic frames and SQL queries are generated according to the following strategy: 1. simple 1-best match 2. full search in all N-best lists 3. under specified (street name and last name required to be contained in the N-best list) • Output is “converted” to synthetic speech.

  31. Dialogue System • Java implementation of dialogue system and telephony server. • uses SLANG speech recognition library in C++ • connects to public domain SQL database (mySQL) • system directed dialogue • one word pr. turn - high perplexity • dynamic, parallel allocation of recognisers

  32. Performance Lack of test data - SpeechDat data were used (!) Person names task: • First name, optional middle name, last name • 434 test utterances (speaker independent) Results from predecessor configuration: (10646 last names, 2777 first/middle names): • Recognition accuracy 1-best : 39.1 %

  33. Conclusion - Part 2 Real system probably needs application specific data - not mentioning the dialogue aspect ! Effect of further acoustic model optimisation (on SpeechDat) may be marginal, when N-best lists are used Limited number of pronunciation variants available Immediate steps are:- test data !- acoustic validation of retrieved candidates Mixed initiative dialogue - CPK’s incentive to work on NaNu !

More Related