Spoken Language Identification Using the Speechdat-M Corpus

Spoken Language Identification Using the Speechdat-M Corpus Diamantino Caseiro - Isabel Trancoso INESC/IST Spoken Language Identification Using the Speechdat-M Corpus

Language Identification • Best systems use multiple large vocabulary continuous speech recognisers. • But are hard to extend to new languages because they require large amounts of hard-to-get linguistic data (such as transcribed speech). • Phonotactic approaches • Published systems still require some linguistic data Spoken Language Identification Using the Speechdat-M Corpus

Phonotactic Aproaches - PRLM-P (phonetic recognition followed by language modelling - parallel)multiple language-specific phone recognisers Spoken Language Identification Using the Speechdat-M Corpus

Phonotactic Aproaches DBD (double bigram decoding)one language independent phone recogniser Spoken Language Identification Using the Speechdat-M Corpus

Speechdat-M Multilingual 6 languages: English,Spanish, German, Portuguese, Italian, French. Telephone Speech Includes: Numbers/Digits/Hours/ Dates/Money/Commands Phonetically rich sentences Etc. Orthographic transcriptions Subset used: Phonetically rich sentences 6 languages x 1000 speakers x 9 utterances The same sentence is read by more than one speaker Utterances with 5 seconds average duration. Corpus Spoken Language Identification Using the Speechdat-M Corpus

Corpus - Train/test selection • Criteria • Speakers: 70% train, 30% test • Sentences: 70% train, 30% test • Random selection Spoken Language Identification Using the Speechdat-M Corpus

Baseline System • Objective: Creation of high performance modules. • PRLM Architecture (Phone Recognition followed by Language Modelling) Spoken Language Identification Using the Speechdat-M Corpus

Baseline System - Modules • Parameters extraction • MFCC: 12 cepstral coef. + 12 delta cepstral + energy + delta energy • Mean cepstral subtraction • Acoustic units • 80 units = 39 Portuguese phones x 2 sexes + silence + pause • Phone recogniser • Continuous HMMs with 8 mixtures • Language models • Interpolated phone bigrams • Classifier • Maximum likelihood Spoken Language Identification Using the Speechdat-M Corpus

Continuous HMMs with 8 mixtures Train Used only Portuguese speech and orthographic transcriptions Flat start with embedded Baum-Welch Recogniser: Viterbi Recognises only all-male or all-female phone sequences. Phone recognition performance. Baseline System - Modules - Recogniser Correctness Accuracy Train utterances 55,5% 52,5 Test utterances 54,1% 50,5 Spoken Language Identification Using the Speechdat-M Corpus

Baseline System - Results • Global identification rate 71.1% • Language proximity revealed • Portuguese better identified Spoken Language Identification Using the Speechdat-M Corpus

Proposed System - Bootstrappeddouble bigram decoding Spoken Language Identification Using the Speechdat-M Corpus

Proposed System - Results • Identification rate increased to 83.5%. • The duration of the utterance is an important factor • 86.1% with [7,8[ seconds utterances Spoken Language Identification Using the Speechdat-M Corpus

Conclusions • A language identification system easy to extend to new languages • Language proximity hurts identification Spoken Language Identification Using the Speechdat-M Corpus

Spoken Language Identification Using the Speechdat-M Corpus

Spoken Language Identification Using the Speechdat-M Corpus

Presentation Transcript

Spoken Language

Using Corpus Resources in English Language Teaching

Evaluating Spoken Language Skills the leading test of spoken language

Spoken Language Structure

Analysing spoken language in literary texts: a corpus-linguistic approach

Spoken Language Processing

Spoken Language

spoken language

The language of Spoken Discourse:

Spoken Language difficulties:

SPOKEN LANGUAGE CORPUS PROJECT

SPOKEN LANGUAGE COMPREHENSION

Spoken Language Understanding

Spoken Language

Studying spoken language

Spoken Language Understanding

Workshop: Corpus (1) What might a corpus of spoken data tell us about language?

Spoken Language Processing

Spoken Language Translation

SPOKEN LANGUAGE ANALYSIS