Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Introduction We propose a double-digit voice biometric system for secure access in telephone services. The system combines text-dependent speaker Authentication and also text validation. Main System Characteristics: • Feature Extraction based on Perceptual Linear Prediction (PLP) coefficients and Mel Frequency Cepstral Coefficients (MFCC) • Concatenated phoneme HMMs for both speech recognition and user authentication • Operates in a sound-prompted mode. • Speech recognition and speaker verification performance was evaluated against: • The length of the training data, • The number of embedded re-estimations and Gaussian mixtures in training of the HMMs, • The use of world models and bootstrapping, • User-depended thresholds 2 A. Kounoudes

System Overview • User is voice-prompted for utterances to create speech samples. • A front-end feature extractor calculates the voice features. • Input speech is validated against the prompted utterance. • Successful validation leads to verification. • During the verification phase, the system verifies that the captured speech matches the models of the enrolled user. - the accumulated log likelihood probability of the input speech frames against the registered users model is compared with a threshold to decide whether to accept or reject the speaker. • The system accepts or rejects the speaker. The enrolment procedure, is used from the system to create HMM speaker-specific phoneme models for each user. 3 A. Kounoudes

System Architecture 4 A. Kounoudes

Data Collection • In-house database: • Comprises of data that were collected over a period of four months over the GSM and PSTN networks. • Contains speech samples from 23 speakers, which are categorized for enrolment and verification purposes. • YOHO-PSTN: • Replica of the YOHO corpus recorded over the PSTN network(using an analogue modem). • YOHO-GSM Database: • Replica of the YOHO corpus recorded over the GSM network(using an analogue modem). • The YOHO database were used for initial training of the HMMs 5 A. Kounoudes

Text Validation (Speech Recognition) • Aim: evaluate the performance of the text validation over the two telephone channels. • The text validation performance is evaluated against: • Number of embedded re-estimations used in training • The utilisation of bootstrapping in training, • The number of Gaussian mixtures of the HMM models, • The incorporation of PLP and MFCC coefficients. 6 A. Kounoudes

Embedded Re-estimations Evaluation against:Number of embedded re-estimations of the Baum-Welch Algorithm on DD recognition performance. Models used: • 12 MFCC + Normalized Energy + Delta + Delta-Delta Coefficients • Continuous density single Gaussian mono-phone HMMs (18) • 3 left-to-right states • Results: • 4 embedded re-estimations suffice. • Asymptoticallyconverges to maximum performance for the specific 1 GM system. 7 A. Antonakoudi

Gaussian Mixtures Evaluation against: number ofGaussian Mixtures (GM) per HMM state, while keeping the number of embedded re-estimations at 4. • Results: • Recognition performance increases with the number of GMs. • The computational complexity is exponentially increasing with the number of GMs. • The increase in performance from 4 to 8 GMs does not compensate for the computational complexity which almost doubles. 8 A. Kounoudes

Use of YOHO-trained HMMs • Evaluation: • If pre-trained HMM prototype can result in a better performance. • Whether additional training will adapt the models to the Greek accent and pronunciation of the speakers in the in-house database. • Experiment Setup: • YOHO-PSTN trained HMM models for bootstrapping • Additional training using the enrolment files of the In-house database • Testing using the verification files of the database. • Results: • Recognition performance increased by 2-4%. 9 A. Kounoudes

PLP and MFCC coefficients • Experiment Setup: • YOHO-GSM and YOHO-PSTN databases to bootstrap additional HMM training on the 80% of the In-house database. • The remaining 20% of the database was used for testing. • Results: • PLP coefficients outperform MFCC (2-3% increase in performance). • Cepstral Mean Subtraction (CMS) improves performance by approximately 2%. • 8 GM DD recognizer with PLP+CMS (10 embedded re-estimations) results in a 98.4% sentence recognition performance. 10 A. Kounoudes

Speaker Verification Evaluation of the speaker verification performance of the system against various parameters: • The use of MFCC and PLP Coefficients. • The Number of the Utterances used for training speaker-specific HMM models. • The selection of the Speaker Authentication Decision Threshold. • The Normalization of HMM scores through the use of a World Model. 11 A. Kounoudes

EER MFCC and PLP coefficients • Experiment Setup: • Single GM HMMs were trained for each speaker using the five enrolment sessions (each session contains 10 DD utterances) of the In-house database. • Each speaker is authenticated against all 23 HMM speaker models using his/her 150 DD authentication utterances. X axis: Impostor speakers attacking each model Y axis: Speaker dependent HMM models Z axis: Averaged HMM scores. Horizontal plane: threshold for which FAR=FRR. Main diagonal: represents speaker identification for the 23 sets of the In-house database speakers. 12 A. Kounoudes

SA using PSTN and GSM Enrolment Data Using 30 authentication sessions from each speaker, tests were performed to evaluate the speaker authentication performance against False Acceptance Rate (FAR), False Rejection Rate (FRR) and Equal Error rate (EER). • CMS can improve speaker authentication performance when applied either on MFCC or PLP feature sets. • The use of PLP coefficients was found to improve the speaker verification performance by 1-4% when compared to the MFCCs. 13 A. Kounoudes

HMM SA Decision Threshold • Applying the threshold which corresponds to FAR=FRR, the individual FAR and FRR for each speaker can be estimated. • Observation: FAR is not equal to FRR for each speaker and at some cases the deviation is considerably high. • Repetition of tests using PLPs and CMS but calculating the EER as the mean of the individual EER of each speaker showed that: - The EER is significantly dropped from 3.52% to 1.14%. - The decision threshold (estimated as the average individual threshold) was found to produce a much better EER when compared to the one estimated by averaging the utterance scores. 14 A. Kounoudes

Normalization using World Model • We investigated whether the use of a World model for the normalization of the HMM scores of each individual improves the overall SA performance. • The world model relies on the development of a universal speaker model from a pool of speech utterances produced by various speakers. • Present evaluations were based on pre-training a world model using all the speakers in the enrollment data of the In-house and the two YOHO databases evaluating speaker authentication using the verification part of the In-house database. • Test showed that: • The EER calculated over all individual EER for each speaker using a world model was 0.094%, while the EER calculated performing identical tests without using a world model was 1.14%. • The use of the world model to normalize verification scores can significantly improve speaker authentication performance. 15 A. Kounoudes

Conclusions • Voice Biometric System: • Text-dependent concatenate phoneme HMM-based speaker verifier • Concatenate phoneme HMM-based speech recognizer • Sound-prompted operation over the PSTN and GSM • Evaluation using a custom In-house database + 2 versions of YOHO. • Text Validation Evaluation: • 4 GMs is a good tradeoff for accuracy and complexity • DD speech recognition performance converges asymptotically after 4 embedded re-estimations of the Baum-Welch algorithm. • Bootstrapping initial HMM training results in an improvement of performance. • CMS improves double-digit recognition performance by approximately 2%. • PLP coefficients outperform MFCCs when speech is recorded via different channels. • Speaker Verification Evaluation: • CMS increases HMM speaker authentication performance (MFCC and PLP). • PLP produce approximately 2% better performance compared to MFCCs. • Speaker dependent thresholds and the use of a world model further improve speaker verification performance resulting in EER=0.094%. 16 A. Kounoudes

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics