Speaker Recognition Research in Joensuu

Speech and Image Processing Unit (SIPU) http://cs.joensuu.fi/sipu/ Puheteknologian talviseminaari Speaker Recognition Research in Joensuu Pasi Fränti Joensuu 10.3.2006

Goals for PUMS season 3 (1/2) • Usability of automatic speaker identification in forensic applications • Compatibility with large databases • Automatization of LTAS + fusion with MFCC. • Voice activity detection

Goals for PUMS season 3 (2/2) • Speaker verification in real (noisy) environment • Prototype for access control • Solving technical requirements for prototype in elevator. • Usability for detecting sound sources in general • Key word search (using HTK or Lingsoft Recognizer)

Research Group PUMS personnel Pasi Fränti Professor Ilja Sidoroff Marko Tuononen, BSc Rosa Gonzalez-Hautamäki, MSc Doctoral researchers Collaborators Juhani Saastamoinen, PhLic Ismo Kärkkäinen, MSc Ville Hautamäki, MSc Tomi Kinnunen, PhD (Singapore) Victoria Yanulevskaya Evgeny Karpov, MSc (NRC)

1. Applicability to forensic applications • Automatic speaker recognition study has been done. • Results are not reported but actions taken within tasks 3 and 4. • Material can be found in Kinnunen’s PhD thesis [4] and Niemi-Laitinen’s presentation.

2. Support for large databases - Not yet done -

3. LTAS and other features • Automatic calculation of LTAS done. Integration to WinSprofiler in progress. Reporting in progress. • Benefit of LTAS is merely its speed and ease of use: no difficult control parameters. • No additional benefit to recognition accuracy. MFCC includes the same information. • Could be used for preliminary pruning in case of large datasets.

Noise robustness of F0 feature Results reported in [3, 5]

4. Voice activity detection • Software for speech segmentation (VoiceGrep). • Command line version for Linux. • Windows version in WinSprofiler. • Testing done in SIPU laboratory. • Labtec® pc mic 333, 44,1 kHz • Recordings were emphasized 24 dB by Audacity voice editor

4a. Test material and results • Material • 4 hours in total. • Bad quality recordings: 11 bits data, of which 4-5 informatio, and the rest noise. • VoiceGrep made 168 detections: • 56 speech (33%) • 112 non-speech (67%) • Material included 71 real speech segments: • Average segment length 16 s. • VoiceGrep found 25 of these (35 %)

4b. VoiceGrep overall results

4c. VoiceGrep example(Correct detection) End of the speech is missed Start of the speech is detected correctly Play sample #1

4d. VoiceGrep example(false detections) Door opening Running water Walking Door Play sample #2 Play sample #3

4e. VoiceGrep example(missed speech segment) Door Door Speech and walking Play sample #4

4f. Entire data set(4 hours) Data Speech segments Result of VoiceGrep

5. Speaker verification in noisy environment • Systematic testing of the effective parameters has been reported in [1]. • Applicability of speaker verification in real environment has been reported in [2] and in Kinnunen’s PhD thesis [5]. • Additional testing will be done if enough time.

5a. Text-dependent verificationin access control • Utilizing time series information improves recognition. • Best result if everyone has their own password.

6. Prototype for access control Emergency button Microphone Motion detector

7. Calling elevator(technical requirements) • Communication with OPC-server: • Implemented with Matrikon server. • Program logic to elevator implemented: • Reads variables from OPC-server. • Interprets and shows elevator status. • Includes recording logic. • Speaker and voice related stuff: • Not yet implemented. • Main window does not show anything yet.

8. Usability for detecting sound sources in general - Not yet done -

9. Keyword search - Not yet done -

Publications (season 3) • J. Saastamoinen, Z. Fiedler, T. Kinnunen and P. Fränti, "On factors affecting MFCC-based speaker recognition accuracy", Int. Conf. on Speech and Computer (SPECOM'05), Patras, Greece, 503-506, October 2005. • H. Gupta, V. Hautamäki, T. Kinnunen and P. Fränti, "Field evaluation of text-dependent speaker recognition in an access control application", Int. Conf. on Speech and Computer (SPECOM'05), Patras, Greece, 551-554, October 2005. • T. Kinnunen, R. Gonzalez-Hautamäki, "Long-Term F0 Modeling for Text-Independent Speaker Recognition" Int. Conf. on Speech and Computer(SPECOM'05), Patras, Greece, 567-570, October 2005.

Theses (season 3)Opinnäytetyöt • T. Kinnunen, "Optimizing Spectral Feature Based TextIndependent Speaker Recognition”, PhD thesis, University of Joensuu, June 2005. • R. Gonzalez-Hautamäki, "FundamentalFrequency Estimation and Modeling for Speaker Recognition”, MSc thesis, University of Joensuu, July 2005.

Applications scenarios Speaker Recognition Speaker Verification Speaker Identification Is this Bob’s voice? Whose voice is this? ? + (Claim) Identification Verification Imposter!

Software 1: Console program

Software 2: WinSprofiler

Software 3: Symbian Port to Symbian OS with Series 60 UI platform

Software 4: Door SProfiler Opening laboratory door by speaking

Software 5: Lift SProfiler(to appear in season 4 perhaps…)

Future development (1) Software integration Keyword search WinSprofilerWindows (JoY)MobileSeries 60 (JoY) DBsupport SRLIB: VAD MSE F0 extractionfusion by weighted MSE VQ GMM MFCC LTAS

Future development (2) Applications Call center Forensic applications Calling elevator Speech analyzer tool Access control common speaker recognition app. interface Verification Classifier fusion Segmentation Keyword search srlib VAD DB

Future development (3) Technical development • Implement and integrate F0, maybe also other formants (F1, F2). • Automatic voiced/unvoiced segmentation. • User enrollment. • Use of sequence information (triplets). • Development of WinSprofiler software to the direction of voice profiler and speech analyzer tool!

Machine room Lift car & hardware Future development (4) CAN GW box EthernetTCP/IP Display Microphone Our PC Approach detection OPC server SRLIB 3.0 DCOM Elevator prototype OPC client LiftCaller

Alice Alice Speaker Recognition Verified & allowed Bob Speaker Recognition Paul Speaker Recognition Minna Not registered Speaker Recognition Unknown VPN Vision 1: Teleconferencing Speaker Recognition Unkonwn Minna Bob

Vision 2: Call-center • Speech is the main tool for people in call-center • Voice login of personell • Removes the need for manual entry

Vision 3: Language recognition • Related problem to speaker recognition – the same research groups usually study both problems. • Not trivial to solve. • Studied a lot for Asian languages, even for rare languages that do not have any ”written form”.

Vision 4: Medical applications • Doctor use voice to record summary of patient meetings. • Access by keyword search. • Annotation. • Authentication of speaker.

Thank for you patience! Questions?

Speaker Recognition Research in Joensuu

Speaker Recognition Research in Joensuu

Presentation Transcript

Speaker Recognition

Automatic Speaker Recognition in Military Environment

Speaker Recognition

Developments in automatic speaker recognition at the BKA

Language modeling for speaker recognition

Speaker Recognition

A Text-Independent Speaker Recognition System

Speaker recognition Phase 1: Detecting speech

SPEAKER RECOGNITION

Speaker Recognition

Speaker Recognition

Speaker Recognition

An Intro to Speaker Recognition

Speaker Recognition Experiment

Automatic Speaker Recognition In Forensic Environment

Speaker Recognition

Speaker Recognition

IRISA 2003 SPEAKER RECOGNITION SYSTEM

Speaker Recognition

Robust Speaker Recognition

Using Speaker Recognition

Chapter 14 Speaker Recognition