340 likes | 582 Views
Automatic Speaker Recognition In Forensic Environment. International Organization on Computer Evidence Conference. Hirotaka Nakasone, Ph.D. Federal Bureau of Investigation. OUTLINE. BACKGROUND DESCRIPTION OF FASR FASR SYSTEM CAPABILITIES SUMMARY: Current Status and Future Plans.
E N D
Automatic Speaker Recognition In Forensic Environment International Organization on Computer Evidence Conference Hirotaka Nakasone, Ph.D. Federal Bureau of Investigation FBI-ERF-Quantico
OUTLINE • BACKGROUND • DESCRIPTION OF FASR • FASR SYSTEM CAPABILITIES • SUMMARY:Current Status and Future Plans FBI-ERF-Quantico
BACKGROUND Forensic ASR Problems • Every month, the FBI receives numerous criminal • cases involving recorded voice samples. • Most voice samples are recorded in uncontrolled • environments, and there are many unknown sources • of variability. • Four primary sources of voice sample variations • of interest to the forensic community include: • - Speaker source characteristics • - Phonetic characteristics • - Transmission channel characteristics • - Equipment characteristics FBI-ERF-Quantico
BACKGROUND • FASR Prototype System developed jointly by a team effort including: • U.S. Air Force Research Laboratories, Rome, NY • BAE Systems, Austin, TX • Massachusetts Institute of Technology Lincoln Laboratory • Federal Bureau of Investigation • History of FASR Development Effort at FAVIAU • 1997: Completed FBI’s Voice Database for ASR Technology Assessment • 1998: Launch ASR Technology Assessment • 1999: Completed Tech Assessment, formed a multiagency team project • 2000: Taken a delivery of prototype FASR • 2001: Prototype performance refinement continued • 2002: Prototype performance enhancement effort in progress FBI-ERF-Quantico
BACKGROUND Minimum Core Technologies Sought • Text-Independence • Channel-Independence • Language-Independence • Decisions with Confidence Measures • Known Error Rates FBI-ERF-Quantico
Text-Independence 4 3 F3 Frequency (KHz) 2 F2 1 F1 0 Time (sec) 0 2.4 Combattype.wav FBI-ERF-Quantico
SPEAKER RECOGNITION BY SPECTROGRAMS Text-Dependent Approach 0m 0t 0t 0m 1m 1t 1t 1m 3m 3t 3m 3t 4t 4m 4m 4t FBI-ERF-Quantico
Text-independent Scenario: Spectrographic approach not feasible There is a bomb in Cen te nnial I s po tted a bag lying on the ground under the b e n ch FBI-ERF-Quantico
Periodogram of Simultaneously Recorded Voice Samples Transmission Modes Body Transmitter: Electret microphone plus AM transmitter. Nominal bandwidth is 300 Hz - 3600 Hz. Microphone: B&K Model 4155 Nominal bandwidth is 20 Hz - 8000 Hz. In-House Telephone : Nominal bandwidth is 300 Hz - 3600 Hz. Remote Telephone (no periodogram): Nominal bandwidth is 300 Hz - 3600 Hz. FBI-ERF-Quantico
BACKGROUND Level II: Text Dependent, Transmission Independent Level I: Text Independent, Transmission Independent I want you to pay me now. I want you to pay me now. Process Process Speaker 1 Decision Speaker 1 Decision I never said that to him. Never. I want you to pay me now. Process Process Speaker 2 Speaker 2 Level III : Text Independent, Transmission Dependent Level IV: Text Dependent, Transmission Dependent I want you to pay me now. I want you to pay me now. Process Process Speaker 1 Speaker 1 Decision Decision I never said that to him. Never. I want you to pay me now. Process Process Speaker 2 Speaker 2 FBI-ERF-Quantico
DESCRIPTION OF FASR • FASR Prototype System Implemented at FBI/FAVIAU • A prototype developed jointly by U.S. Air Force Research Laboratories, Rome NY, and FBI/FAVIAU with technical inputs from MIT Lincoln Lab and BAE SYSTEMS • FASR uses robust speaker recognition algorithms: • Mel cepstral coefficients, D, DD • Cepstral mean subtraction or RASTA filtering • Gaussian Mixture Models with Universal Background Models • FASR is a PC-based workstation on a LAN with an efficient • Graphic User Interface supporting: • Data acquisition and playback • Signal and spectrographic display • Speech enhancement • Speech segmentation and labeling • Tone detection and removal • Speech quality measures (SNR, duration, bandwidth) • Speaker Identification and Verification • Universal Background Model (UBM) Generation • Automated Computation of Confidence Measurements for each UBM FBI-ERF-Quantico
FASR System Description Speech Data Digitized at 8 KHz SR or higher, 16 bit resolution Speech Quality Check S/N >10 dB Signal Duration > 10 s Usable FBW > 3KHz To reduce the effects of transmission, the mean cepstral subtraction technique used. Feature extraction is performed after silence is removed. Confidence Measures of each decision. CM is based on true probability density functions, and false probability density functions. Input Files Processing Feature Extraction Decision Analog form, Digital form, usually in a PCM/WAV format Automatic Detection/Removal of tones, clicks/pops Mel-scaled Cepstral- Coefficients Delta Cepstra Channel-normalization GMM with Universal Background Model FBI-ERF-Quantico
Input speech as a continuous evolution of the vocal tract 2-D (time-frequency) waveform must be transformed Use a 20 ms window, and slide 10 ms • Perceptually constructed filterbank Mel scale used • Linearly spaced up to 1 Khz • Logarithmically spaced above 1 Khz FFT Cepstral Feature Vectors Cosine Transform Transformed into a continuous evolution of the spectrogram 3-D (time-frequency-amplitude) spectrogram Use a 20 ms window, and slide 10 ms FBI-ERF-Quantico
FASR System Description Speech Samples from each speaker Channel Effects Feature Extraction GMM Classifier Compare with Speaker Models Log Likelihood Ratio Test Channel Effects GMM Score Feature Extraction GMM Classifier Compare with Background Models Speech Sample from unknown speaker FBI-ERF-Quantico
Speaker Identification Ranking by the LLRT/GMM scores Speaker Verification Binary decision by LLRT/GMM scores Known Speakers GMM Scores Ranked Set Threshold q = 0.500 John Peter Unknown Peter 1 0.005 Peter Mike GMM Score = 0.900 0.900 2 Is Peter same as the Unknown Voice ? Unknown Bruce John -0.105 3 Jack Jack 4 Since Score 0.900 > q,it is determined that Unknown voice belongs to Peter. 0.000 Mike Bruce A question is: “How certain are we in our conclusion that the two voices are same?” 0.800 1000 Whose voice is closest to Unknown Speaker? FBI-ERF-Quantico
A question is: “How certain are we in our conclusion that the two voices are same?” Confidence Measures For a given GMM score, find the confidence in a “True” decision based on a sample True/False population True, False, and Test Scores 0.06 Preset Threshold q = 0.500 0.04 False Score Probability Density True Score Unknown Peter Test Score 0.02 0 GMM Score = 0.900 -1 -0.5 0 0.5 1 1.5 2 Probability Confidence Measure 1 0.8 Confidence Value = 92% 0.6 Confidence Curve P(Ht|x) Test Score 0.4 Confidence Value 0.2 0 -1 -0.5 0 0.5 1 1.5 2 GMM Output Scores FBI-ERF-Quantico
In-house Studies on the Effects of Signal Quality on Confidence Measures • Speech/Recording Quality (In Progress) • Speech Duration (Completed) • Signal to Noise Ratio (In Progress) • Speech Frequency Bandwidth (TBD) FBI-ERF-Quantico
Studies on the effects of laughs, distortions, tones, and multiple speakers Voice Database: 1999 NIST 225 Male Speakers Tone(s) detected automatically. Removed either automatically or manually by experimenter Laughs were perceptually determined, and removed by experimenter Multiple Speakers: All voices other than the target voice manually marked and deleted Distortions: Non-linear distortion (Over driven), loud transient noise, manually removed when possible FBI-ERF-Quantico
Studies on the effects of laughs, distortions, tones, and multiple speakers FBI-ERF-Quantico
Exploratory Studies with Bilingual Database - Toward Language-Independent Speaker Recognition - Description of Database Language: English and Spanish Bilingual Speakers: 43 male speakers 30 female speakers Mode of Speech: Reading Spontaneous (Fill-in) Recording Session: Only one session Recording Channel: Hi-Fidelity Microphone Digitized at 44.1 K/16 bit Acknowledgement This bilingual database was furnished by AFRL, Rome, NY, in the Fall of 2001. FBI-ERF-Quantico
Training Set Duration in Seconds 16 14 12 10 8 6 4 2 1 0.5 16 100 100 99.2 98.8 99.2 97.7 91.9 74.4 43.8 20.5 14 100 100 99.2 99.2 97.7 96.1 90.7 76 42.6 19.8 12 100 100 98.8 96.5 95 95 87.6 70.5 37.2 17.8 10 99.2 98.8 98.1 97.7 93.8 91.5 84.9 69 39.5 16.7 8 96.1 94.6 93.4 91.9 88.8 86.1 77.5 61.2 36.1 18.2 6 96.1 93.4 91.1 89.9 86.8 83.3 75.2 60.1 31 14.3 4 87.6 86.8 83.7 81.8 78.7 75.2 68.6 52.3 25.2 12.4 2 75.2 72.9 70.9 68.6 64.3 64.3 58.5 47.7 20.5 12 1 50.4 47.7 47.3 46.5 45 45.4 43 43.4 12.8 5.4 0.5 25.6 24 24.8 24 23.6 22.9 21.3 22.1 7 3.9 ID Performance in % as a Function of Duration Test Set Duration in Seconds FBI-ERF-Quantico
Training Set Duration in Seconds 16 14 12 10 8 6 4 2 1 0.5 16 3.9 3.9 4.2 5.1 5.4 7.6 10.1 18.6 28.7 39.1 14 5.2 5.7 5.5 6.6 7.8 9.6 11.6 19.6 29.4 40.3 12 8.3 8.1 9.1 8.7 10 13.1 14 20.5 31.8 42.8 10 9.3 9.3 10.3 10.2 12.7 14.2 15.6 23.8 32.6 43.8 8 14.3 15.1 16.1 16.5 17.8 19.2 22.6 26.2 35.3 44.7 6 19.8 19.4 20.6 21.3 22.5 23.3 23.6 27.3 37.2 46.1 4 27.4 28.3 27.9 29.8 29.6 29.8 29.5 30.6 39.8 46.6 2 18.4 38.6 38.4 36.4 37.1 37 35.3 33.6 42.6 48.5 1 44.6 45.1 45.3 44.1 45 43.7 42.2 39.3 46.1 49.6 0.5 48.5 48.7 48.4 48.6 48.1 47.3 47.8 45.1 50 50.8 Table 3 – EER Performance as a Function of Duration Test Set Duration in Seconds FBI-ERF-Quantico
a) TRN=16, TST=16 Seconds b) TRN=16, TST=4 Seconds c) TRN=16, TST=2 Seconds d) TRN=16, TST=1 Second Score Distributions for Variable Test, Fixed Training Durations FBI-ERF-Quantico
a) TRN=16, TST=16 Seconds b) TRN=6, TST=16 Seconds c) TRN=4, TST=16 Seconds d) TRN=1, TST=16 Seconds Score Distributions for Variable Training, Fixed Test Durations FBI-ERF-Quantico
a) TST=16 sec b) TST=4 sec c) TST=2 sec d)TST=1 sec a) TRN=16, b) TRN=6, c) TRN=4 d) TRN=1, TRN=16 sec TST=16 sec FBI-ERF-Quantico
Detection Error Tradeoff (DET) Curve 40 35 30 25 20 Poorer Performance Probability of Miss (%) 15 10 Better Performance 5 0 5 10 15 20 25 30 35 40 45 50 55 60 Probability of False Acceptance (%) FBI-ERF-Quantico
Detection Error Tradeoff (DET) Curve Test set (sec) 16,14,12,10,8,6,4,2,1,0.5 s For a fixed training duration of 16 s 40 35 30 4 s 25 20 Probability of Miss (%) 6 s 15 10 10 s 5 0 5 10 15 20 25 30 35 40 45 50 55 60 Probability of False Acceptance (%) FBI-ERF-Quantico
Acceptable FASRPerformance - 12/2000 Open Set Speaker Verification Level I EER < = 2.0%, T/M, s/s, 30 sec Level II EER < = 2.0%, T/M, p/p, 3 sec Level III EER < = 1.0%, T/T, s/s, 30 sec Level IV EER < = 0.5%, T/T, p/p, 3 sec EER: Equal Error Rate for False Identification and False Rejection T: Recording by telephone systems M: Recording by hi-fi microphone systems S: Spontaneous speech samples P: Prescribed speech samples FBI-ERF-Quantico
Summary • The FBI is using a PC-based forensic automatic speaker recognition (FASR) system – Turn around time is better than the traditional spectrographic method, but not operational in a real time mode at this time. Forensic post-processing only. • FASR has been extensively tested on NIST single speaker and FBI Forensic Voice database • Confidence Measures computed from a single feature space • Language types limited primarily to English, a small set of Arabic language (Farsi) Current Status • Collect larger databases including Non-English foreign languages and bi-or multilingual speakers – A joint funded effort in progress with Technology Support Working Group • Improve on existing channel normalization techniques, find new approaches • Integrate automatic or manual pre-screening procedures based upon quantifiable signal quality measures • Confidence Measures computed from a multi-feature space • Provide for a no decision rule when the signal quality does not meet predefined conditions – Place a safeguard against the potential abuse and misuse of the technology • Address the issue of procedures of creating/selecting an optimum Universal Background Model Future Plans FBI-ERF-Quantico