390 likes | 645 Views
Speech and Image Processing Unit (SIPU) http://cs.joensuu.fi/sipu/. Puheteknologian talviseminaari. Speaker Recognition Research in Joensuu. Pasi Fränti. Joensuu 10.3.2006. Goals for PUMS season 3 (1/2). Usability of automatic speaker identification in forensic applications
E N D
Speech and Image Processing Unit (SIPU) http://cs.joensuu.fi/sipu/ Puheteknologian talviseminaari Speaker Recognition Research in Joensuu Pasi Fränti Joensuu 10.3.2006
Goals for PUMS season 3 (1/2) • Usability of automatic speaker identification in forensic applications • Compatibility with large databases • Automatization of LTAS + fusion with MFCC. • Voice activity detection
Goals for PUMS season 3 (2/2) • Speaker verification in real (noisy) environment • Prototype for access control • Solving technical requirements for prototype in elevator. • Usability for detecting sound sources in general • Key word search (using HTK or Lingsoft Recognizer)
Research Group PUMS personnel Pasi Fränti Professor Ilja Sidoroff Marko Tuononen, BSc Rosa Gonzalez-Hautamäki, MSc Doctoral researchers Collaborators Juhani Saastamoinen, PhLic Ismo Kärkkäinen, MSc Ville Hautamäki, MSc Tomi Kinnunen, PhD (Singapore) Victoria Yanulevskaya Evgeny Karpov, MSc (NRC)
1. Applicability to forensic applications • Automatic speaker recognition study has been done. • Results are not reported but actions taken within tasks 3 and 4. • Material can be found in Kinnunen’s PhD thesis [4] and Niemi-Laitinen’s presentation.
2. Support for large databases - Not yet done -
3. LTAS and other features • Automatic calculation of LTAS done. Integration to WinSprofiler in progress. Reporting in progress. • Benefit of LTAS is merely its speed and ease of use: no difficult control parameters. • No additional benefit to recognition accuracy. MFCC includes the same information. • Could be used for preliminary pruning in case of large datasets.
Noise robustness of F0 feature Results reported in [3, 5]
4. Voice activity detection • Software for speech segmentation (VoiceGrep). • Command line version for Linux. • Windows version in WinSprofiler. • Testing done in SIPU laboratory. • Labtec® pc mic 333, 44,1 kHz • Recordings were emphasized 24 dB by Audacity voice editor
4a. Test material and results • Material • 4 hours in total. • Bad quality recordings: 11 bits data, of which 4-5 informatio, and the rest noise. • VoiceGrep made 168 detections: • 56 speech (33%) • 112 non-speech (67%) • Material included 71 real speech segments: • Average segment length 16 s. • VoiceGrep found 25 of these (35 %)
4c. VoiceGrep example(Correct detection) End of the speech is missed Start of the speech is detected correctly Play sample #1
4d. VoiceGrep example(false detections) Door opening Running water Walking Door Play sample #2 Play sample #3
4e. VoiceGrep example(missed speech segment) Door Door Speech and walking Play sample #4
4f. Entire data set(4 hours) Data Speech segments Result of VoiceGrep
5. Speaker verification in noisy environment • Systematic testing of the effective parameters has been reported in [1]. • Applicability of speaker verification in real environment has been reported in [2] and in Kinnunen’s PhD thesis [5]. • Additional testing will be done if enough time.
5a. Text-dependent verificationin access control • Utilizing time series information improves recognition. • Best result if everyone has their own password.
6. Prototype for access control Emergency button Microphone Motion detector
7. Calling elevator(technical requirements) • Communication with OPC-server: • Implemented with Matrikon server. • Program logic to elevator implemented: • Reads variables from OPC-server. • Interprets and shows elevator status. • Includes recording logic. • Speaker and voice related stuff: • Not yet implemented. • Main window does not show anything yet.
8. Usability for detecting sound sources in general - Not yet done -
9. Keyword search - Not yet done -
Publications (season 3) • J. Saastamoinen, Z. Fiedler, T. Kinnunen and P. Fränti, "On factors affecting MFCC-based speaker recognition accuracy", Int. Conf. on Speech and Computer (SPECOM'05), Patras, Greece, 503-506, October 2005. • H. Gupta, V. Hautamäki, T. Kinnunen and P. Fränti, "Field evaluation of text-dependent speaker recognition in an access control application", Int. Conf. on Speech and Computer (SPECOM'05), Patras, Greece, 551-554, October 2005. • T. Kinnunen, R. Gonzalez-Hautamäki, "Long-Term F0 Modeling for Text-Independent Speaker Recognition" Int. Conf. on Speech and Computer(SPECOM'05), Patras, Greece, 567-570, October 2005.
Theses (season 3)Opinnäytetyöt • T. Kinnunen, "Optimizing Spectral Feature Based TextIndependent Speaker Recognition”, PhD thesis, University of Joensuu, June 2005. • R. Gonzalez-Hautamäki, "FundamentalFrequency Estimation and Modeling for Speaker Recognition”, MSc thesis, University of Joensuu, July 2005.
Applications scenarios Speaker Recognition Speaker Verification Speaker Identification Is this Bob’s voice? Whose voice is this? ? + (Claim) Identification Verification Imposter!
Software 3: Symbian Port to Symbian OS with Series 60 UI platform
Software 4: Door SProfiler Opening laboratory door by speaking
Future development (1) Software integration Keyword search WinSprofilerWindows (JoY)MobileSeries 60 (JoY) DBsupport SRLIB: VAD MSE F0 extractionfusion by weighted MSE VQ GMM MFCC LTAS
Future development (2) Applications Call center Forensic applications Calling elevator Speech analyzer tool Access control common speaker recognition app. interface Verification Classifier fusion Segmentation Keyword search srlib VAD DB
Future development (3) Technical development • Implement and integrate F0, maybe also other formants (F1, F2). • Automatic voiced/unvoiced segmentation. • User enrollment. • Use of sequence information (triplets). • Development of WinSprofiler software to the direction of voice profiler and speech analyzer tool!
Machine room Lift car & hardware Future development (4) CAN GW box EthernetTCP/IP Display Microphone Our PC Approach detection OPC server SRLIB 3.0 DCOM Elevator prototype OPC client LiftCaller
Alice Alice Speaker Recognition Verified & allowed Bob Speaker Recognition Paul Speaker Recognition Minna Not registered Speaker Recognition Unknown VPN Vision 1: Teleconferencing Speaker Recognition Unkonwn Minna Bob
Vision 2: Call-center • Speech is the main tool for people in call-center • Voice login of personell • Removes the need for manual entry
Vision 3: Language recognition • Related problem to speaker recognition – the same research groups usually study both problems. • Not trivial to solve. • Studied a lot for Asian languages, even for rare languages that do not have any ”written form”.
Vision 4: Medical applications • Doctor use voice to record summary of patient meetings. • Access by keyword search. • Annotation. • Authentication of speaker.
Thank for you patience! Questions?