250 likes | 392 Views
LID/SID - Research Stay at BUT Last Presentation. Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo ” fellowship Education Ministry - Spanish Government February 20 th , 2012. Outline. Research stay goals Work on phonotactic LID
E N D
LID/SID - Research Stay at BUTLast Presentation Luis Fernando D’Haro Polytechnical University of Madrid Granted by “José Castillejo” fellowship Education Ministry - Spanish Government February 20th, 2012
Outline • Research stay goals • Work on phonotactic LID • Discriminative n-grams • New phonotactic system • Using i-vectors and multinomial subspace model • Work on LID-RATS • VAD and LID • Future work Page 2
Research Stay Goals • To work with most recent techniques for LID such as: • i-Vectors, sGMM, WCCN, score calibration/fusion • To test our ranking templates and discriminative n-gram selection approach with the acoustic i-Vector system for LID task • Ideas: • Fusion of scores • Selection of discriminative n-grams • Collaboration on current BUT campaigns • RATS, LRE, SRE • Publications Page 3
Work on Phonotactic LID • LID based on ranking positions and distance • Original idea: Original idea: [Cavnar and Trenkle, 1994] Page 4
Improvements to the Ranking approach • One ranking for each n-gram order • Golf position • All n-grams with the same number of occurrences share the same position in the ranking • Discriminative positions in the ranking • Put in higher positions of the rank the most relevant n-grams for each language • i.e. very frequent in one language but not in the others • A new formula inspired on td-idf providing normalized scores (1, and -1) • Advantages: high order n-grams (up to 5-g) • More details at [Caraballo et al, 2010]
Experiments on LRE09 • Baseline: phonotactic PCA [Mikolov et al, 2010] • Use soft-counts n-grams for different phone recognizers • Our system uses only the normalized score generated by the system, not the classifier • Our baseline classifier based on distance among languages did not work fine • Approaches: • Comparison/fusion with the PCA system • Fusion with acoustic iVectors system (400 iVectors, 2048 Gauss) • Selection of discriminative n-grams • Goal: reduce the input vector of n-gram soft-counts • Database: • Train: 9763 segments (345 hours, ~500 utt. per language) • Dev: 38134 segments from the 23 languages of LRE09 • Test: 41545 segments
Comparison with phonotactic PCA • Baseline approach: • Feature vector: Expected N-gram phoneme counts estimated from lattices • For all possible trigrams and most frequent four-grams, e.g. • 3-grams: 33^3 = 35 937, (Hungarian phone-ASR) • 4-grams: 33^4 = 1 185 921 • Then, apply PCA to reduce the vector size (baseline:1000) • Discriminative approach • Original templates (up to 4-grams) • Engl: 45_2025_100K_200K • Russ: 47_2209_100K_200K • Hung: 33_1089_35K_200K
Results* Results*: problems to reproduce the same results reported in the paper No good results in almost all cases. Big difference in comparison with baseline using only 3-g and PCA.
Selection of discriminative n-grams • Goal: Help PCA to reduce the size of the feature vector, by first selecting the most discriminative n-grams and then applying PCA • Reducing from 35K to aprox. 8K for 3-grams • Using 16K for 4-g instead of 80K most frequents [Mikolov et al, 2010] and concatenating them with the 8K trigrams • Selection based on the discriminability among all languages • We also try using probabilities instead of vector of counts • Fusion with acoustic i-Vector systems • 600 iVectors + 2048 Gaussians • Cavg for baseline iVectors: • 30s: 2.40% • 10s: 4.93% • 3s: 14.04%
Results – Disc. PhonotacticSystem 3g-4gCounts_1KPCA 3g-4gProbs_1KPCA BASE3G_1KPCA 3gCounts_1KPCA 3gProbs_1KPCA Base+3gProbs_1KPCA
Results – Disc. PhonotacticSystem + iVectors 3g-4gCounts_1KPCA 3g-4gProbs_1KPCA BASE3G_1KPCA 3gCounts_1KPCA 3gProbs_1KPCA Base+3gProbs_1KPCA iVectors
Conclusions phonotactic • For LID system based on templates we need to find better solutions for scoring normalization • Discriminative n-gram selection helps both phonotactic PCA system and iVector system • Better results using probabilities instead of counts because of problems with different length of files • ToDo: Test Length Normalization • Find better approach to the selection of high-order n-grams • ToDo: use clusters of scores in the discriminative approach to be able to handle high order n-grams (currently implemented but we did not try it this time)
New Phonotactic system Page 13 • Baseline: [Soufifar et al, 2011] • Use n-gram soft-counts from lattices • Use subspace multinomial distributions for estimating iVectors • Use iVectors for classifying + using logistic regression (libLinear) • Differences • Instead of n-gram soft-counts we use posterior-gram conditional counts • Use original features, or iVectors, or PCA on original features • Use Multiclass Logistic Regression + length normalization • Results on bigrams and trigrams (no time for fine tunning) • Same training, test and dev sets as for LRE09 • Fusion with the acoustic iVector system
Work on LID-RATS & VAD-RATS • Goals: • Test different noise reduction and speech enhancement algorithms • Test different robust features • Test different BUT VADs • Combine with iVectors • Database • Eight noise conditions + clean data • Experiments on the 2 minutes condition and short list • Train: 3458 files (115 h) • Dev: 7331 files (244 h)
Work on LID-RATS • Noise tools and algorithms • Ctucopy, developed at SpeechLab (FEE CTU - Prague) • Extended spectral substraction [Sovka and Pollák. 1996] • Spectral substraction with full wave rectification • Using internal and external VAD (i.e. BUT-VAD) • Wiener filter [Zavarehei, 2005] • QIO Aurora Front-end from OGI [QIO, 2009] • Internal NN_VAD + CMN/CVN + RASTA-LDA + Wiener Filter • ETSI: Advanced Front End [ETSI, 2007] • 2-pass adaptive Wiener filter + internal VAD (uses energy info from the whole spectrum and F0 regions) • Kalman filter [Murphy, 1998]
Work on LID-RATS • SDC: Shifted Delta Cepstra • RPLP: proposed by [Rajnoha and Pollák, 2011] at SpeechLab at FEE CTU Prague • Hybrid between MFCC + PLP • Tests w/w.o Rasta, VTLN, CMN/CVN • Test new positions of the filterbank • After studying the spectogram and noise reduction effects • woNR: 300-3200, wNR:500-3000 • Common and new features • MFCC/PLP + Delta and Delta-Delta • PNCC: proposed by [Kim and Stern al, 2010] at CMU • Spectral Delta-Delta: proposed by [Kumar et al, 2011] at CMU
Conclusions RATS-LID • No any improvement when using de-noising techniques • QIO toolkit provided the best result • Important improvements due to correct selection of Low and High frequency bands • RPLP: New robust features for LID • PNCC: promising features for LID but training time is high • Spectral Delta-Delta slightly better than traditional delta-deltas but not than SDC • Use of Rasta and CMN/CVN completely necessary for high performance • Short-term CMN/CVN did not provide better results
Future work • Discriminative n-grams • New techniques for working with higher n-grams orders • Better combination of information from parallel phoneme recognizers • To write a joined paper based on using LRE09 • PhonotacticiVector: Promising results • Check combination of parallel phone recognizers • Incorporation of discriminative information • LRE/SRE • Try collaborations on following NIST competitions
Questions? Děkuji!! Velicevámděkujizapozornost a dobrouzkušenost z práce v tétoskupině!! Page 21
Bibliography I Caraballo, M. A. et al. 2010. "A Discriminative Text Categorization Technique for Language Identification built into a PPRLM System". FALA, pp. 193- 196. Cavnar, W. B. and Trenkle, J.M . 1994. “N-Gram-Based Text Categorization”. SDAIR-94, pp. 161-175. ETSI: Advanced Front End V1.1.5. 2007. Available at http://www.etsi.org/WebSite/Technologies/DistributedSpeechRecognition.aspx Kim, C. and Stern, R.M. 2010. “Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring”. ICASSP, pp. 4574 – 4577. Mikolov et al. 2010. “PCA-basedfeatureextractionforto phonotactic language recognition”. Odyssey, pp. 251-255. Murphy, K. 1998. “KalmanfiltertoolboxforMatlab”. Available at http://www.cs.ubc.ca/~murphyk/Software/Kalman/kalman.html
Bibliography II Qualcomm-ICSI-OGI (QIO) Aurora frontend. 2009. Available at ftp://ftp.icsi.berkeley.edu/pub/speech/papers/qio/aurora-front-end/ Rajnoha, J., and Pollák, P. 2011. “ASR systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness”. Radionegineering, Vol. 20, No. 1, April 2011, pp. 74-84. Soufifar, M. et al. 2011. “iVector approach to phonotactic language recognition”. Interspeech, pp. 2913-2916. Sovka, P., and Pollák, P. 1996. “Extended spectral subtraction” Eurospeech, pp. 963-966. Zavarehei, E. 2005. Wienerfilterimplementation in Matlab. Available at http://www.mathworks.com/matlabcentral/fileexchange/7673-wiener-filter/content/WienerScalart96.m