Acoustic modeling on telephony speech corpora for directory assistance systems applications

Acoustic modeling on telephony speech corpora for directory assistance systems applications Børge Lindberg, Center for PersonKommunikation (CPK), Aalborg University Denmark [lindberg@cpk.auc.dk]

Outline Part 1 - Acoustic modeling Reference recogniser (COST 249) Part 2 - Directory assistance NaNu - Names & Numbers (Tele Danmark) Acoustic model optimisation Project- and system details

COST 249 The COST 249 SpeechDat Multilingual Reference Recogniserhttp://www.telenor.no/fou/prosjekter/taletek/refrec F.T. Johansen, N. Warakagoda (Telenor, Kjeller, Norway), B. Lindberg (CPK, Aalborg, Denmark), G. Lehtinen (ETH, Zürich, Switzerland), Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor, Slovenia), B. Milner, D. Chaplin (British Telecom, Ipswich, UK), K. Elenius, G. Salvi (KTH, Stockholm, Sweden), E. Sanders, F. de Wet (KUN, Nijmegen, The Netherlands)

What is the reference recogniser? Phoneme based recogniser design procedure Language-independent Fully automatic, one script works straight from CDs Standardised database format: SpeechDat(II) Available in many languages world wide Oriented towards telephone applications Commonly available recogniser toolkit: HTK

Motivation • A fast start for recognition research in new languages • Share experience, avoid doing the same mistakes • Improve state-of-the-art • Share research efforts • Provide a benchmark for recogniser performance comparison across tasks and languages • Facilitate true multilingual recognition research

Related Work COST 232 Assumed TIMIT-like segmented database Reference verification systems CAVE, PICASSO COST 250 GlobalPhone (Schultz & Waibel, ICSLP 98): Dictation type multilingual databases Language independent and -adaptive recognition

SpeechDat(II) databases 20 FDBs (fixed network), 5 MDBs (mobile networks) 500-5000 speakers, 4-8 minutes recording sessions Telephone information and transaction services Compatible databases: SpeechDat(E): 5 central and Eastern European languages SALA: 8 dialect zones in Latin America SpeechDat-Car: 9 languages, parallel GSM and in-car SpeechDat Australian English

Core Utterance Types in SpeechDat(II) number type corpus code 1 isolated digit items I 5 digit/number strings B,C 1+ natural numbers N 1 money amounts M 2 yes/no questions Q 3+ dates D 2 times T 3 application keywords/keyphrases A 1 word spotting phrase E 5 directory assistance names O 3 spellings L 4+ phonetically rich words W 9 phonetically rich sentences S 40+ In total

Recogniser design - version 0.95 Standard HTK tutorial features (39-dimensional MFCC_0_D_A), no normalisation Word internal triphone HMMs, 3 states per model Decision-tree state clustering Trained from flat-start using only orthographic transcriptions and a SpeechDat lexicon Remove “difficult” utterances from the training set 1,2,4,8,16 and 32 diagonal covariance Gaussian mixtures Re-training on re-segmented material

MFCC_0_D_A - feature set Pre-empasis 0.97 Frame shift 10 ms Analysis window Hamming Window length 25 ms Spectrum type FFT-magnitude Filterbank type Mel-scale Filter shape Triangular Filterbank channels 26 Cepstral coefficients 12 Cepstral liftering 22 Energy feature C0 Deltas 13 Delta-deltas 13 Total features 39

Test design Common test suite on SpeechDat I-test: Isolated digit recognition (SVIP) Q-test: Yes/no recognition (SVIP) A-test: Recognition of 30 isolated application words (SVIP) BC-test: Unknown length connected digit string recognition (SVWL) O-test: City name recognition (MVIP) W-test: Recognition of phonetically rich words (MVIP) Two test procedures used SVIP: Small Vocabulary Isolated Phrase MVIP: Medium Vocabulary Isolated Phrase SVWL: Small Vocabulary Word Loop, NIST alignment

Results Six labs have completed the training procedure on the SpeechDat(II) databases KUN has converted the Dutch Polyphone to SpeechDat(II) format: train only on phonetically rich sentences tests only on digit strings More details available on the web

Training Statistics * External information available (either session list, pronunciation lexicon or a phoneme mapping - see web-site)** Results are for Refrec. v. 0.93

A typical training curve

Word error rates * Results are for Refrec. v. 0.93 Average numberof phonemes intest vocabularies

Word error rates - cont.

Language independent considerations Performance probably below state-of-the-art systems No whole-word modelling, no cross-word context (especially needed for connected digits) A lot of training data with noise has been removed No speaker noise of filled pause model Not robust enough feature analyser

Language differences Mobile database has 3-5 times the error rate of FDBs more robust modeling needed Slovenian: high noise level on recordings

Conclusion - part 1 Practical/logistic problems mostly solved Future work: Improve language and database coverage More speakers: Swedish 5000 More challenging tests, large vocabularies More analyses Improved training procedure, clustering

Directory assistance NaNu Børge Lindberg, Bo Nygaard Bai, Tom Brøndsted, Jesper Ø. Olsen • Recognition of ‘Names & Numbers’ • In collaboration with Tele Danmark • Auto attendant/directory assistance applications • Large vocabulary - for the first time in Danish • Exploiting the SpeechDat(II) database

Acoustic modeling - Decision trees (Ref: HTK Book)

Acoustic modeling of Danish diphthongs

Acoustic modeling - CMN

Acoustic modeling - decision trees

NaNu Acoustic models • SpeechDat - COST 249 • 20k+ tied-mixture tri-phones, 6554 clusters • 16 mixture models - 100k+ mixture components Database • ¼ million subscribers (Århus and Næstved areas) Vocabulary extracted from database, for which: • there is a minimum of two occurences • transcription exists (Onomastica)

Vocabulary and Coverage NaNu Vocabulary # Unique database entries, Denmark (source Tele Danmark)

SLANG Recogniser - Spoken LANGuage • Speech Recognition Research Platform • For Dialogue Systems execution • Modular design and implementation (C++) • Frame synchronous operation • Dynamic Tree Structured Decoder • Optimised towards large vocabulary recognition (Gaussian mixture selection)

NLP • N-Best listsare parsed into semantic frames and SQL queries are generated according to the following strategy: 1. simple 1-best match 2. full search in all N-best lists 3. under specified (street name and last name required to be contained in the N-best list) • Output is “converted” to synthetic speech.

Dialogue System • Java implementation of dialogue system and telephony server. • uses SLANG speech recognition library in C++ • connects to public domain SQL database (mySQL) • system directed dialogue • one word pr. turn - high perplexity • dynamic, parallel allocation of recognisers

Performance Lack of test data - SpeechDat data were used (!) Person names task: • First name, optional middle name, last name • 434 test utterances (speaker independent) Results from predecessor configuration: (10646 last names, 2777 first/middle names): • Recognition accuracy 1-best : 39.1 %

Conclusion - Part 2 Real system probably needs application specific data - not mentioning the dialogue aspect ! Effect of further acoustic model optimisation (on SpeechDat) may be marginal, when N-best lists are used Limited number of pronunciation variants available Immediate steps are:- test data !- acoustic validation of retrieved candidates Mixed initiative dialogue - CPK’s incentive to work on NaNu !

Acoustic modeling on telephony speech corpora for directory assistance systems applications

Acoustic modeling on telephony speech corpora for directory assistance systems applications

Presentation Transcript

IP Telephony Applications for Handhelds

IP Telephony Applications for Handhelds

IP Telephony / VoIP Applications

5.0 Acoustic Modeling

Non p arametric Bayesian Approaches for Acoustic Modeling in Speech Recognition

Acoustic Modeling

TRILL Directory Assistance Mechanisms

TRILL Directory Assistance Mechanisms

Non p arametric Bayesian Approaches for Acoustic Modeling in Speech Recognition

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

Recent Work on Acoustic Modeling for CTS at ISL

Speech reconstruction on NAP/AAA corpora

Multimodal corpora and speech technology

Progress on Forward Modeling for Acoustic-Propagation Simulations

Advanced Telephony Applications

Introduction to Speech Corpora@Stanford

Advanced Telephony Applications

State Tying for Acoustic Modeling

Acoustic modeling

Acoustic Modeling for Speech Recognition

IP Telephony / VoIP Applications

MLP speech enabled systems applications