170 likes | 424 Views
Spoken Language Identification Using the Speechdat-M Corpus. Diamantino Caseiro - Isabel Trancoso INESC/IST. Language Identification. Best systems use multiple large vocabulary continuous speech recognisers.
E N D
Spoken Language Identification Using the Speechdat-M Corpus Diamantino Caseiro - Isabel Trancoso INESC/IST Spoken Language Identification Using the Speechdat-M Corpus
Language Identification • Best systems use multiple large vocabulary continuous speech recognisers. • But are hard to extend to new languages because they require large amounts of hard-to-get linguistic data (such as transcribed speech). • Phonotactic approaches • Published systems still require some linguistic data Spoken Language Identification Using the Speechdat-M Corpus
Phonotactic Aproaches - PRLM-P (phonetic recognition followed by language modelling - parallel)multiple language-specific phone recognisers Spoken Language Identification Using the Speechdat-M Corpus
Phonotactic Aproaches DBD (double bigram decoding)one language independent phone recogniser Spoken Language Identification Using the Speechdat-M Corpus
Speechdat-M Multilingual 6 languages: English,Spanish, German, Portuguese, Italian, French. Telephone Speech Includes: Numbers/Digits/Hours/ Dates/Money/Commands Phonetically rich sentences Etc. Orthographic transcriptions Subset used: Phonetically rich sentences 6 languages x 1000 speakers x 9 utterances The same sentence is read by more than one speaker Utterances with 5 seconds average duration. Corpus Spoken Language Identification Using the Speechdat-M Corpus
Corpus - Train/test selection • Criteria • Speakers: 70% train, 30% test • Sentences: 70% train, 30% test • Random selection Spoken Language Identification Using the Speechdat-M Corpus
Baseline System • Objective: Creation of high performance modules. • PRLM Architecture (Phone Recognition followed by Language Modelling) Spoken Language Identification Using the Speechdat-M Corpus
Baseline System - Modules • Parameters extraction • MFCC: 12 cepstral coef. + 12 delta cepstral + energy + delta energy • Mean cepstral subtraction • Acoustic units • 80 units = 39 Portuguese phones x 2 sexes + silence + pause • Phone recogniser • Continuous HMMs with 8 mixtures • Language models • Interpolated phone bigrams • Classifier • Maximum likelihood Spoken Language Identification Using the Speechdat-M Corpus
Continuous HMMs with 8 mixtures Train Used only Portuguese speech and orthographic transcriptions Flat start with embedded Baum-Welch Recogniser: Viterbi Recognises only all-male or all-female phone sequences. Phone recognition performance. Baseline System - Modules - Recogniser Correctness Accuracy Train utterances 55,5% 52,5 Test utterances 54,1% 50,5 Spoken Language Identification Using the Speechdat-M Corpus
Baseline System - Results • Global identification rate 71.1% • Language proximity revealed • Portuguese better identified Spoken Language Identification Using the Speechdat-M Corpus
Proposed System - Bootstrappeddouble bigram decoding Spoken Language Identification Using the Speechdat-M Corpus
Proposed System - Results • Identification rate increased to 83.5%. • The duration of the utterance is an important factor • 86.1% with [7,8[ seconds utterances Spoken Language Identification Using the Speechdat-M Corpus
Conclusions • A language identification system easy to extend to new languages • Language proximity hurts identification Spoken Language Identification Using the Speechdat-M Corpus