180 likes | 299 Views
BioSec Multimodal Biometric Database in Text-Dependent Speaker Recognition. D. T. Toledano, D. Hernández-López, C. Esteve-Elizalde, J. Fiérrez, J. Ortega-García, D. Ramos and J. Gonzalez-Rodriguez ATVS, Universidad Autonoma de Madrid, Spain LREC 2008, Marrakesh, Morocco, 28-30 May 08. Outline.
E N D
BioSec Multimodal Biometric Databasein Text-Dependent Speaker Recognition D. T. Toledano, D. Hernández-López, C. Esteve-Elizalde, J. Fiérrez, J. Ortega-García, D. Ramos and J. Gonzalez-Rodriguez ATVS, Universidad Autonoma de Madrid, Spain LREC 2008, Marrakesh, Morocco, 28-30 May 08
Outline • 1. Introduction and Goals • 2. Databases for Text-Dependent Speaker Recognition • 3. BioSec and Related Databases • 4. Experiments with YOHO and BioSec Baseline • 4.1. Text Dependent SR Based on Phonetic HMMs • 4.2. YOHO and BioSec Experimental Protocols • 4.3. Results with YOHO and BioSec Baseline • 5. Conclusions
1. Introduction and Goal • Text-Independent Speaker Recognition • Unknown lexical content • Research driven by yearly NIST SRE evals and databases • Text-Dependent Speaker Recognition • Lexical content of test utterance known by system • Password set by the user or text prompted by the system • No competitive evaluations by NIST • Less research and less standard benchmarks • YOHO is probably the best known benchmark • Newer databases are available, but results are difficult to compare • Goals • Study BioSec as a benchmark for text dependent Speaker Rec. • Compare results on BioSec and YOHO with the same method
2. Databases for Text-Dependent Speaker Recognition • YOHO (Campbell & Higgings, 1994): speech – Clean mic. speech, 138 speakers, 24 utt x 4 ses. for enrolment, 4 utt. x 10 ses. for test. (“12-34-56”) – Best known benchmark • XM2VTS (Messer et al. 1999): speech, face – Clean microphone speech, 295 subjects, 4 sessions • BIOMET (Garcia-Salicetti et al.2003): speech, face, fingerprint, hand, signature – Clean mic. speech, 130 subjects, 3 ses. • BANCA (Billy-Bailliere et al. 2003): speech, face – Clean and noisy mic. speech, 208 subjects, 12 sessions • MYIDEA (Dumas et al. 2005): speech, face, fingerprint, signature, hand geometry, handwritting – BIOMET + BANCA contents for speech, 104 subjects, 3 sessions • MIT Mobile Device Speaker Verification (Park and Hazen, 2006): speech – Mobile devices, realistic noisy conditions • M3 (Meng et al. 2006): speech, face, fingerprint - Microphone speech (3 devices), 39 subjects, 3 sessions (+108 single session) • MBioID (Dessimoz et al. 2007): speech, face, iris, fingerprint, signature - Microphone clean speech, 120 subjects, 2 sessions
3. BioSec and Related Databases (i) • BioSec: (Fiérrez-Aguilar et al. 2007) • Acquired under FP6 EU BioSec IP. • Sites involved: UPM, UPC, TID, MIFIN, UCOL, UTA, KULRD. • Speech, fingerprint (3 sensors), face and iris • 250 subjects, 4 sessions • Speech is recorded using two microphones: • Head-mounted close-talking microphone • Distant webcam microphone • 4 utterances of the user’s PIN + 3 utterances of other users’ PINs (PIN = 8 digit number) • Simulation of informed forgeries • Both in English and Spanish • Most speakers are Spanish speakers • BioSec Baseline: • Subset of BioSec comprising 200 subjects and 2 sessions
3. BioSec and Related Databases (ii) • BioSecurID: • Speech, iris, face, handwriting, fingerprint, hand geometry and keystroking • Microphone speech in realistic office-like scenario • 400 subjects • BioSecure: • Three scenarios: Internet, office-like and mobile • 1000 subjects (internet), 700 subjects (office-like and mobile) • BioSecure and BioSecureID share subjects with BioSec, which allows long-term studies • BioSec has several other advantages over YOHO: • Multimodal, Multilingual (Spanish/English), Multichannel (close-talking and webcam) • Same lexical content for target trials, allows simulation of informed forgeries • But it also has a clear disadvantage: • It is harder to compare results on BioSec
4. Experiments with YOHO and BioSec • Goals: • Study BioSec Baseline as a benchmark for text-dependent SR • Compare the difficulty of YOHO and BioSec Baseline • Goals achieved through: • Common text-dependent speaker recognition method • Clear evaluation protocols • Analysis of results for different conditions
4.1. Text-dependent SR based on phonetic HMMs: Enrollment Phase • Speech parameterization (common to enrollment and test) • 25 ms Hamming windows with 10 ms window shift • 13 MFCCs + Deltas + Double Deltas 39 coeffs • Spk-indep, context-indep phonetic HMMs used as base models • English: 39 phones trained on TIMIT, 3 states left-to-right • Spanish: 23 phones trained on Albayzin, 3 states left-to-right • Spk-dep phonetic HMMs from transcribed enrollment audio Speaker Dependent Phonetic HMMs (speaker model) Enrollment Parameterized Utterances Baum-Welch Retraining Or MLLR Adaptation Phonetic Transcriptions (with optional Sil) Spk-Indep models of the utterances, λI Spk-Indep Phonetic HMMs
4.1. Text-dependent SR based on phonetic HMMs: Verification Phase • Computation of acoustic scores for spk-dep and spk-indep models • Acoustic scores Verification score ( removing silences) Spk-Indep model of the utterance, λI Spk-Indep Acoustic Scores Spk-Indep Phonetic HMMs Viterbi Phonetic Transcription (with optional Sil) Parameterized Audio to Verify Spk-Dep model of the utterance, λD Spk-Dep Acoustic Scores Viterbi Spk-Dep Phonetic HMMs
4.2. YOHO Experimental Protocol • YOHO database • 138 speakers (106 male, 32 female) • Enrollment data: 4 sessions x 24 utterances = 96 utterances • Test data: 10 sessions x 4 utterances = 40 utterances • Utterance = 3 digit pairs (i.e. “twelve thirty four fifty six”) • Usage of YOHO in this work • Enrollment: 3 different conditions • 6 utterances from the 1st enrollment session • 24 utterances from the 1st enrollment session • 96 utterances from the 4 enrollment sessions • Test: always with a single utterance • Target trials: 40 test utterances for each speaker (138 x 40 = 5,520) • Non-tgt trials: 137 test utterances for each speaker (138 x 137 = 18,906) • One random utterance from the test data of each of the other users • Text-Prompted simulation: the utterance spoken is always the utterance expected by the system
4.2. BioSec Experimental Protocol • BioSec Baseline database • 200 speakers, 2 sessions • Session = 4 utterances of each user’s PIN and 3 of other user’s PINs • PIN = 8 digits (i.e. “one two three four five six seven eight”) • Usage of BioSec in this work • Following BioSec Baseline core protocol (Fiérrez et al. 2005) • This protocol limits the amount of subjects to only 150. • Target trials: 150x4x4 = 2,400: • Enrollment: one of the 4 utterances of each user’s PIN from 1st session • Test: one of the 4 utterances of each user’s PIN from 2nd session • Non Target trials: 150x149/2 = 11,175: • Enrollment: 1st user’s PIN from 1st session • Test: 1st PIN from 1st session of the rest of the users, avoiding symmetric matches • Enrollment and test always with a single utterance • Lexical content is the same in enrollment and test in target trials but is usually different in non-target trials • Text-Prompted simulation: the utterance spoken is always the utterance expected by the system
4.3. Results with YOHO • DET curves and %EERs comparing • Baum-Welch Re-estimation vs. MLLR Adaptation • Different amounts of enrollment material • 6, 24 or 96 utterances • MLLR Adaptation provides better performance for all conditions • With 96 utterances for enrollment EER is below 1%, with 24 it is about 2% and with 6 it is below 5%
4.3. Results with BioSec Baseline (i) • Spanish, head-mounted close-talking microphone • MLLR Adaptation • Enrolling with a single utterance in BioSec results (1.7%) are better than enrolling with 24 utterances in YOHO (2.1%) • Probably due to the lexical match between enrollment and verification in target trials
4.3. Results with BioSec Baseline (ii) • English, webcam distant microphone • MLLR Adaptation • Results for English and webcam distant microphone are almost an order of magnitude worse!! • Possible causes: • Channel variation • Non-native speakers
4.3. Results with BioSec Baseline (iii) • New results: not in the paper • English, head-mounted close-talking microphone • MLLR Adaptation • Results for English and close-talking microphone (2.2%) are only slightly worse than for Spanish (1.7%) • Possibly due to non-native speakers • The main reason for the poor results with distant microphone is the channel
4.3. Results with BioSec Baseline (iv) • New results: not in the paper • Spanish, webcam distant microphone • MLLR Adaptation • For Spanish and distant microphone results are again much worse than for close-talking microphone • Huge impact of the channel on speaker recognition performance • No channel robustness techniques used (only CMN)
5. Conclusions • We have studied BioSec Baseline as a benchmark for text-dependent speaker recognition • We have tried to facilitate comparison of results between YOHO and BioSec Baseline by evaluating the same method on both corpora • For close-talking microphone results on BioSec are much better than results on YOHO • Probably due to the lexical match in enrollment and verification • For distant webcam microphone results on BioSec are much worse than results on YOHO • Due to the channel variation • No channel robustness techniques used (only CMN)