Recent work on Language Identification

Recent work on Language Identification Pietro Laface POLITECNICO di TORINO Brno 28-06-2009 Pietro LAFACE

Team POLITECNICO di TORINO Pietro Laface Professor FabioCastaldo Post-doc Sandro Cumani PhD student Ivano Dalmasso Thesis Student LOQUENDO Claudio Vair Senior Researcher Daniele Colibro Researcher Emanuele Dalmasso Post-doc

Our technology progress • Inter-speaker compensation in feature space • GLDS / SVM models (ICASSP 2007) - GMMs • SVM using GMM super‑vectors (GMM-SVM) • Introduced by MIT-LL for speaker recognition • Fast discriminative training of GMMs • Alternative to MMIE • Exploiting the GMM-SVM separation hyperplanes • MIT discriminative GMMs • Language factors

Without normalization With Kullback‑Leibler normalization • Training GMM-SVM models • Training Discriminative GMMs • Inter-speaker/channel variation compensation GMM super‑vectors Appending the mean value of all Gaussians in a single stream we get a super-vector We use GMM super-vectors

Using an UBM in LID • The frame basedinter-speakervariationcompensation approach estimates the inter-speaker compensation factors using the UBM • In the GMM-SVM approach all language GMMs share the same weights and variances of the UBM • The UBM is used for fast selection of Gaussians

Speaker/channel compensationin feature space • U is a low rank matrix (estimated offline) projecting the speaker/channel factors subspace in the supervector domain. • x(i) is a low dimensional vector, estimated using the UBM, holding the speaker/channel factors for the current utterance i. • is the occupation probability of the m-th Gaussian

Estimating the U matrix Estimating the U matrix with a large set of differences between models generated using different utterances of the same speaker we compensate the distortions due to the inter-session variability Speakerrecognition Estimating the U matrix witha large set of differences between models generated using different speaker utterances of the same language we compensate the distortions due to inter-speaker/channel variability within the same language  Language recognition

GMM-SVM models perform very well with rather long test utterances GMM-SVM weakness It is difficult to estimate a robust GMM with a short test utterance Exploit the discriminative information given by the GMM-SVM for fast estimation of discriminative GMMs

SVM discriminative directions w: normal vector to the class‑separation hyperplane

Shift each Gaussian of a language model along its discriminative direction, given by the vector normal to the class‑separation hyperplane in the KL space Feature Space GMM discriminative training Language GMM Utterance GMM KL Space

Experiments with 2048 GMMs Pooled EER(%) of Discriminative 2048 GMMs, and GMM-SVM on the NIST LRE tasks. In parentheses, the average of the EERs of each language.

Pushed GMMs (MIT-LL)

Language Factors • Eigenvoice modeling, , and the use of speaker factors as input features to SVMs, has recently been demonstrated to give good results for speaker recognition compared to the standard GMM-SVM approach (Dehak et al. ICASSP 2009). Analogy • Estimate an eigen-language space, and use the language factors as input features to SVM classifiers (Castaldo et al. submitted to Interspeech 2009).

Language Factors: advantages Language factors are low-dimension vectors • Training and evaluating SVMs with different kernels is easy and fast: it requires the dot productof normalized language factors • Using a very large number of training examples is feasible • Small models give good performance

Toward an eigen-language space • After compensation of the nuisances of a GMM adapted from the UBM using a single utterance, residual information about the channel and the speaker remains. • However, most of the undesired variation is removed as demonstrated by the improvements obtained using this technique

Speaker compensated eigenvoices First approach • Estimating the principal directions of the GMM supervectors of all the training segments before inter-speaker nuisance compensation would produce a set of language independent, “universal” eigenvoices. • After nuisance removal, however, the speaker contribution to the principal components is reduced to the benefit of language discrimination.

Eigen-language space Second approach • Computing the differences between the GMM supervectors obtained from utterances of a polyglotspeaker would compensate the speaker characteristics and would enhance the acoustic components of a language with respect to the others. • We do not have labeled databases including polyglot speakers • compute and collect the difference between GMM supervectors produced by utterances of speakers of two different languages irrespective of the speaker identity, already compensated in the feature domain

Eigen-language space • The number of these differences would grow with the square of utterances of the training set. • Perform Principal Component Analysis on the set of the differences between the set of the supervectors of a language and the averagesupervector of every other language.

Training corpora The same used for LRE07 evaluation • All data of the 12 languages in the Callfriend corpus • Half of the NIST LRE07 development corpus • Half of the OSHU corpus provided by NIST for LRE05 • The Russian through switched telephone network • Automatic segmentation

LRE0730s closed set test Language factor’s minDCF is always better and more stable

Pushed GMMs (MIT-LL)

Pushed eigen-language GMMs The same approach to obtain discriminative GMMs from the language factors

Min DCFs and (%EER)

Acoustic features Pushed GMMs SVM-GMMs MMIE GMMs TFLLR SMV Phonetic transcriber N-gram counts N-gram counts N-gram counts Phonetic transcriber TFLLR SMV Phonetic transcriber TFLLR SMV Loquendo-Polito LRE09 System Model Training

Phone transcribers • 12 phone transcribers for • French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English. • The statistics of the n-gram phone occurrences collected from the best decoded string of each conversation segment ASR Recognizer • phone-loop grammar with diphone transition constraints

Phone transcribers • 10 phone transcribers for • Catalan, French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English. • The statistics of the n-gram phone occurrences collected from the expected counts from a lattice of each conversation segment ANN models • Same phone-loop grammar - different engine

Multigrams • Two different TFLLR kernels • trigrams • pruned multigrams • multigrams can provide useful information about the language by capturing “word parts” within the string sequences

Scoring The total number of models that we use for scoring an unknown segment is 34: • 11 channel dependent models (11 x 2) • 12 single channel models (2 telephone and 10 broadcast models only). • 23 x 2 for MMIE GMMs (channel independent but M/F)

34 Pushed GMMs G. back-end 46 34 G.back-end MMIEGMMs 34 34 34 23 34 lre_detection LLR max 1-best3-gramsSVMs G. back-end 34 34 34 1-bestn-gramsSVMs G. back-end Latticen-gramsSVMs G. back-end 34 Calibration and fusion Multi-class FoCal max of the channel dependent scores

Language pair recognition • For the language-pair evaluation only the back-ends have been re-trained, keeping unchanged the models of all the sub-systems.

Telephone development corpora • CALLFRIEND - Conversations split into slices of 150s • NIST 2003 and NIST 2005 • LRE07 development corpus • Cantonese and Portuguese data in the 22 Language OGI corpus • RuSTeN -The Russian through Switched Telephone Network corpus

“Broadcast” development corpora • Incrementally created to include as far as possible the variability within a language due to channel, gender and speaker differences • The development data, further split in training, calibration and test subsets, should cover the mentioned variability

Problems with LRE09 dev data • Often same speaker segments • Scarcity of segments for some languages after filtering same speaker segments • Genders are not balanced • Excluding “French”, the segments of all languages are either telephone or broadcast. • No audited data available for Hindi, Russian, Spanish and Urdu on VOA3, only automatic segmentation was provided • No segmentation was provided in the first release of the development data for Cantonese, Korean, Mandarin, and Vietnamese • For these 8 missing languages only the language hypotheses provided by BUT were available for VOA2 data.

Additional “audited” data • For the 8 languages lacking broadcast data, segments have been generated accessing the VOA site looking for the original MP3 files • Goal collect ~300 broadcast segments per language, processed to detect narrowband fragments • The candidates were checked to eliminate segments including music, bad channel distortions, and fragments of other languages

Telephone and audited/checked broadcast data Training (50 %) Development (25 %) Test (25 %) Development data for bootstrap models The segments were distributed to these sets so that same speaker segments were included in the same set. A set of acoustic (pushed GMMs) bootstrap models has been trained

Additional not-audited data from VOA3 • Preliminary tests with the bootstrap models indicate the need of additional data • Selected from VOA3 to include new speakers in the train, calibration and test sets • assuming that the file label correctly identify the corresponding language

Speaker selection • Performed by means of a speaker recognizer • We process the audited segments before the others • A new speaker model is added to the current set of speaker modelswhenever the best recognition score obtained by a segment is less than a threshold

Additional not-audited data from VOA2 Enriching the training set • Language recognition has been performed using a system combining the acoustic bootstrap models and a phonetic system • A segment has been selected only if • the 1-best language hypothesis of our system had associated a score greater than a given (rather high) threshold • matched the 1-best hypothesis provided by the BUT system

Total number of segments for this evaluation • Suffix: • A audited • C checked • S automatic segmentation • ftp: ftp://8475.ftp.storage.akadns.net/mp3/voa/

Hausa- Decision Cost Function DCF

Hindi- Decision Cost Function DCF

Results on the development set Average minDCFx100 on 30s test segments

b-b Korean - score cumulative distribution t-t t-b b-t

Recent work on Language Identification