1 / 29

Automatic Speaker Recognition In Forensic Environment

Automatic Speaker Recognition In Forensic Environment. International Organization on Computer Evidence Conference. Hirotaka Nakasone, Ph.D. Federal Bureau of Investigation. OUTLINE. BACKGROUND DESCRIPTION OF FASR FASR SYSTEM CAPABILITIES SUMMARY: Current Status and Future Plans.

ophira
Download Presentation

Automatic Speaker Recognition In Forensic Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Speaker Recognition In Forensic Environment International Organization on Computer Evidence Conference Hirotaka Nakasone, Ph.D. Federal Bureau of Investigation FBI-ERF-Quantico

  2. OUTLINE • BACKGROUND • DESCRIPTION OF FASR • FASR SYSTEM CAPABILITIES • SUMMARY:Current Status and Future Plans FBI-ERF-Quantico

  3. BACKGROUND Forensic ASR Problems • Every month, the FBI receives numerous criminal • cases involving recorded voice samples. • Most voice samples are recorded in uncontrolled • environments, and there are many unknown sources • of variability. • Four primary sources of voice sample variations • of interest to the forensic community include: • - Speaker source characteristics • - Phonetic characteristics • - Transmission channel characteristics • - Equipment characteristics FBI-ERF-Quantico

  4. BACKGROUND • FASR Prototype System developed jointly by a team effort including: • U.S. Air Force Research Laboratories, Rome, NY • BAE Systems, Austin, TX • Massachusetts Institute of Technology Lincoln Laboratory • Federal Bureau of Investigation • History of FASR Development Effort at FAVIAU • 1997: Completed FBI’s Voice Database for ASR Technology Assessment • 1998: Launch ASR Technology Assessment • 1999: Completed Tech Assessment, formed a multiagency team project • 2000: Taken a delivery of prototype FASR • 2001: Prototype performance refinement continued • 2002: Prototype performance enhancement effort in progress FBI-ERF-Quantico

  5. BACKGROUND Minimum Core Technologies Sought • Text-Independence • Channel-Independence • Language-Independence • Decisions with Confidence Measures • Known Error Rates FBI-ERF-Quantico

  6. Text-Independence 4 3 F3 Frequency (KHz) 2 F2 1 F1 0 Time (sec) 0 2.4 Combattype.wav FBI-ERF-Quantico

  7. SPEAKER RECOGNITION BY SPECTROGRAMS Text-Dependent Approach 0m 0t 0t 0m 1m 1t 1t 1m 3m 3t 3m 3t 4t 4m 4m 4t FBI-ERF-Quantico

  8. Text-independent Scenario: Spectrographic approach not feasible There is a bomb in Cen te nnial I s po tted a bag lying on the ground under the b e n ch FBI-ERF-Quantico

  9. Periodogram of Simultaneously Recorded Voice Samples Transmission Modes Body Transmitter: Electret microphone plus AM transmitter. Nominal bandwidth is 300 Hz - 3600 Hz. Microphone: B&K Model 4155 Nominal bandwidth is 20 Hz - 8000 Hz. In-House Telephone : Nominal bandwidth is 300 Hz - 3600 Hz. Remote Telephone (no periodogram): Nominal bandwidth is 300 Hz - 3600 Hz. FBI-ERF-Quantico

  10. BACKGROUND Level II: Text Dependent, Transmission Independent Level I: Text Independent, Transmission Independent I want you to pay me now. I want you to pay me now. Process Process Speaker 1 Decision Speaker 1 Decision I never said that to him. Never. I want you to pay me now. Process Process Speaker 2 Speaker 2 Level III : Text Independent, Transmission Dependent Level IV: Text Dependent, Transmission Dependent I want you to pay me now. I want you to pay me now. Process Process Speaker 1 Speaker 1 Decision Decision I never said that to him. Never. I want you to pay me now. Process Process Speaker 2 Speaker 2 FBI-ERF-Quantico

  11. DESCRIPTION OF FASR • FASR Prototype System Implemented at FBI/FAVIAU • A prototype developed jointly by U.S. Air Force Research Laboratories, Rome NY, and FBI/FAVIAU with technical inputs from MIT Lincoln Lab and BAE SYSTEMS • FASR uses robust speaker recognition algorithms: • Mel cepstral coefficients, D, DD • Cepstral mean subtraction or RASTA filtering • Gaussian Mixture Models with Universal Background Models • FASR is a PC-based workstation on a LAN with an efficient • Graphic User Interface supporting: • Data acquisition and playback • Signal and spectrographic display • Speech enhancement • Speech segmentation and labeling • Tone detection and removal • Speech quality measures (SNR, duration, bandwidth) • Speaker Identification and Verification • Universal Background Model (UBM) Generation • Automated Computation of Confidence Measurements for each UBM FBI-ERF-Quantico

  12. FASR System Description Speech Data Digitized at 8 KHz SR or higher, 16 bit resolution Speech Quality Check S/N >10 dB Signal Duration > 10 s Usable FBW > 3KHz To reduce the effects of transmission, the mean cepstral subtraction technique used. Feature extraction is performed after silence is removed. Confidence Measures of each decision. CM is based on true probability density functions, and false probability density functions. Input Files Processing Feature Extraction Decision Analog form, Digital form, usually in a PCM/WAV format Automatic Detection/Removal of tones, clicks/pops Mel-scaled Cepstral- Coefficients Delta Cepstra Channel-normalization GMM with Universal Background Model FBI-ERF-Quantico

  13. Input speech as a continuous evolution of the vocal tract 2-D (time-frequency) waveform must be transformed Use a 20 ms window, and slide 10 ms • Perceptually constructed filterbank Mel scale used • Linearly spaced up to 1 Khz • Logarithmically spaced above 1 Khz FFT Cepstral Feature Vectors Cosine Transform Transformed into a continuous evolution of the spectrogram 3-D (time-frequency-amplitude) spectrogram Use a 20 ms window, and slide 10 ms FBI-ERF-Quantico

  14. FASR System Description Speech Samples from each speaker Channel Effects Feature Extraction GMM Classifier Compare with Speaker Models Log Likelihood Ratio Test Channel Effects GMM Score Feature Extraction GMM Classifier Compare with Background Models Speech Sample from unknown speaker FBI-ERF-Quantico

  15. Speaker Identification Ranking by the LLRT/GMM scores Speaker Verification Binary decision by LLRT/GMM scores Known Speakers GMM Scores Ranked Set Threshold q = 0.500 John Peter Unknown Peter 1 0.005 Peter Mike GMM Score = 0.900 0.900 2 Is Peter same as the Unknown Voice ? Unknown Bruce John -0.105 3 Jack Jack 4 Since Score 0.900 > q,it is determined that Unknown voice belongs to Peter. 0.000 Mike Bruce A question is: “How certain are we in our conclusion that the two voices are same?” 0.800 1000 Whose voice is closest to Unknown Speaker? FBI-ERF-Quantico

  16. A question is: “How certain are we in our conclusion that the two voices are same?” Confidence Measures For a given GMM score, find the confidence in a “True” decision based on a sample True/False population True, False, and Test Scores 0.06 Preset Threshold q = 0.500 0.04 False Score Probability Density True Score Unknown Peter Test Score 0.02 0 GMM Score = 0.900 -1 -0.5 0 0.5 1 1.5 2 Probability Confidence Measure 1 0.8 Confidence Value = 92% 0.6 Confidence Curve P(Ht|x) Test Score 0.4 Confidence Value 0.2 0 -1 -0.5 0 0.5 1 1.5 2 GMM Output Scores FBI-ERF-Quantico

  17. In-house Studies on the Effects of Signal Quality on Confidence Measures • Speech/Recording Quality (In Progress) • Speech Duration (Completed) • Signal to Noise Ratio (In Progress) • Speech Frequency Bandwidth (TBD) FBI-ERF-Quantico

  18. Studies on the effects of laughs, distortions, tones, and multiple speakers Voice Database: 1999 NIST 225 Male Speakers Tone(s) detected automatically. Removed either automatically or manually by experimenter Laughs were perceptually determined, and removed by experimenter Multiple Speakers: All voices other than the target voice manually marked and deleted Distortions: Non-linear distortion (Over driven), loud transient noise, manually removed when possible FBI-ERF-Quantico

  19. Studies on the effects of laughs, distortions, tones, and multiple speakers FBI-ERF-Quantico

  20. Exploratory Studies with Bilingual Database - Toward Language-Independent Speaker Recognition - Description of Database Language: English and Spanish Bilingual Speakers: 43 male speakers 30 female speakers Mode of Speech: Reading Spontaneous (Fill-in) Recording Session: Only one session Recording Channel: Hi-Fidelity Microphone Digitized at 44.1 K/16 bit Acknowledgement This bilingual database was furnished by AFRL, Rome, NY, in the Fall of 2001. FBI-ERF-Quantico

  21. Training Set Duration in Seconds 16 14 12 10 8 6 4 2 1 0.5 16 100 100 99.2 98.8 99.2 97.7 91.9 74.4 43.8 20.5 14 100 100 99.2 99.2 97.7 96.1 90.7 76 42.6 19.8 12 100 100 98.8 96.5 95 95 87.6 70.5 37.2 17.8 10 99.2 98.8 98.1 97.7 93.8 91.5 84.9 69 39.5 16.7 8 96.1 94.6 93.4 91.9 88.8 86.1 77.5 61.2 36.1 18.2 6 96.1 93.4 91.1 89.9 86.8 83.3 75.2 60.1 31 14.3 4 87.6 86.8 83.7 81.8 78.7 75.2 68.6 52.3 25.2 12.4 2 75.2 72.9 70.9 68.6 64.3 64.3 58.5 47.7 20.5 12 1 50.4 47.7 47.3 46.5 45 45.4 43 43.4 12.8 5.4 0.5 25.6 24 24.8 24 23.6 22.9 21.3 22.1 7 3.9 ID Performance in % as a Function of Duration Test Set Duration in Seconds FBI-ERF-Quantico

  22. Training Set Duration in Seconds 16 14 12 10 8 6 4 2 1 0.5 16 3.9 3.9 4.2 5.1 5.4 7.6 10.1 18.6 28.7 39.1 14 5.2 5.7 5.5 6.6 7.8 9.6 11.6 19.6 29.4 40.3 12 8.3 8.1 9.1 8.7 10 13.1 14 20.5 31.8 42.8 10 9.3 9.3 10.3 10.2 12.7 14.2 15.6 23.8 32.6 43.8 8 14.3 15.1 16.1 16.5 17.8 19.2 22.6 26.2 35.3 44.7 6 19.8 19.4 20.6 21.3 22.5 23.3 23.6 27.3 37.2 46.1 4 27.4 28.3 27.9 29.8 29.6 29.8 29.5 30.6 39.8 46.6 2 18.4 38.6 38.4 36.4 37.1 37 35.3 33.6 42.6 48.5 1 44.6 45.1 45.3 44.1 45 43.7 42.2 39.3 46.1 49.6 0.5 48.5 48.7 48.4 48.6 48.1 47.3 47.8 45.1 50 50.8 Table 3 – EER Performance as a Function of Duration Test Set Duration in Seconds FBI-ERF-Quantico

  23. a) TRN=16, TST=16 Seconds b) TRN=16, TST=4 Seconds c) TRN=16, TST=2 Seconds d) TRN=16, TST=1 Second Score Distributions for Variable Test, Fixed Training Durations FBI-ERF-Quantico

  24. a) TRN=16, TST=16 Seconds b) TRN=6, TST=16 Seconds c) TRN=4, TST=16 Seconds d) TRN=1, TST=16 Seconds Score Distributions for Variable Training, Fixed Test Durations FBI-ERF-Quantico

  25. a) TST=16 sec b) TST=4 sec c) TST=2 sec d)TST=1 sec a) TRN=16, b) TRN=6, c) TRN=4 d) TRN=1, TRN=16 sec TST=16 sec FBI-ERF-Quantico

  26. Detection Error Tradeoff (DET) Curve 40 35 30 25 20 Poorer Performance Probability of Miss (%) 15 10 Better Performance 5 0 5 10 15 20 25 30 35 40 45 50 55 60 Probability of False Acceptance (%) FBI-ERF-Quantico

  27. Detection Error Tradeoff (DET) Curve Test set (sec) 16,14,12,10,8,6,4,2,1,0.5 s For a fixed training duration of 16 s 40 35 30 4 s 25 20 Probability of Miss (%) 6 s 15 10 10 s 5 0 5 10 15 20 25 30 35 40 45 50 55 60 Probability of False Acceptance (%) FBI-ERF-Quantico

  28. Acceptable FASRPerformance - 12/2000 Open Set Speaker Verification Level I EER < = 2.0%, T/M, s/s, 30 sec Level II EER < = 2.0%, T/M, p/p, 3 sec Level III EER < = 1.0%, T/T, s/s, 30 sec Level IV EER < = 0.5%, T/T, p/p, 3 sec EER: Equal Error Rate for False Identification and False Rejection T: Recording by telephone systems M: Recording by hi-fi microphone systems S: Spontaneous speech samples P: Prescribed speech samples FBI-ERF-Quantico

  29. Summary • The FBI is using a PC-based forensic automatic speaker recognition (FASR) system – Turn around time is better than the traditional spectrographic method, but not operational in a real time mode at this time. Forensic post-processing only. • FASR has been extensively tested on NIST single speaker and FBI Forensic Voice database • Confidence Measures computed from a single feature space • Language types limited primarily to English, a small set of Arabic language (Farsi) Current Status • Collect larger databases including Non-English foreign languages and bi-or multilingual speakers – A joint funded effort in progress with Technology Support Working Group • Improve on existing channel normalization techniques, find new approaches • Integrate automatic or manual pre-screening procedures based upon quantifiable signal quality measures • Confidence Measures computed from a multi-feature space • Provide for a no decision rule when the signal quality does not meet predefined conditions – Place a safeguard against the potential abuse and misuse of the technology • Address the issue of procedures of creating/selecting an optimum Universal Background Model Future Plans FBI-ERF-Quantico

More Related