Speaker Identification by Combining MFCC and Phase Information

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University of Technologyh, Japan)

Background • The importance of phase in human speech recognition has been reported. • In conventional speaker recognition methods based on mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored.

Purpose and method • We aim to use the phase information for speaker recognition. • We propose a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech and combines the phase information with MFCCs .

Investigating the effect of phase Conventional MFCCs that capture the vocal tract information cannot distinguish the different speaker characteristics caused by vocal source. The phase is greatly influenced by vocal source characteristics. We generated a speech wave for different vocal sources and pitch, and a fixed vocal tract shape corresponding to vowel /a/.

Phase information extraction • The short-term spectrum S(ω, t)for the i-thframe of a signal is obtained by the DFT of an input speech signal sequence • For conventional MFCCs, power spectrum is used, but the phase information is ignored. In this paper, phase is also extracted as one of the feature parameters for speaker recognition.

Problem of unnormalized phase • However, the phase changes depending on the clipping position of the input speech even with the same frequency ω. The unnormalized wrapped phases of two windows become quite a bit different because the phases change depending on the clipping position. Example of the effect of clipping position on phase for Japanese vowel /a/

Phase normalization(1/2) • To overcome this problem, the phase of a certain basis radian frequency of all frames is converted to constant, and the phase of the other frequency is estimated relative to this. In the experiments discussed in this paper, the phase of basis radian frequency is set to 2π ×1000 Hz. • For example, setting the phase of the basis radian frequency to π/4, we have

Phase normalization(2/2) • The difference of unnormalized wrapped phase on basis frequency and the normalized wrapped phase is With ω = 2πfin the other frequency (that is, ), the difference becomes Thus, the spectrum on frequency ωbecomes and the phase information is normalized as

Comparison of unnormalized phase and normalized phase After normalizing the wrapped phase, the phase values become very similar. Example of the effect of clipping position on phase for Japanese vowel /a/

From phase θ to phase{cosθ, sinθ} • There is a problem with this method when comparing two • phase values. For example, with the two values and • , the difference isthen the • difference despite the two phases being very similar to • oneanother. • Therefore, for this research, • we changed the phase into • coordinates on a unit circle, • that is,

How to synchronize the splitting section

Combination method • The GMM based on MFCCs is combined with the GMM based on phase information. • The likelihood of MODEL 1 is linearly coupled with that of MODEL 2 to produce a new score given by where is the likelihood produced by the n-thspeaker model based on MFCC and the n-thspeaker model based on phase, n=1,2,…,Nwith Nbeing the number of speakers registered.

Experimental setup (1/3) • NTT database • # speaker: 35 (22 males and 13 females) • # session: 5 (1990.8, 1990.9, 1990.12, 1991.3, 1991.6) • # training utterance: 5 (1990.8) • # test utterance: 1 (about 4 seconds), 35×4×5=700 trials • JNAS database • # speaker: 270 (135 males and 135 females) • # training utterance: 5 (about 2 seconds / sentence) • # test utterance: 1 (about 5.5 seconds), about 95 sentences / person 270×95=25650 trials

Experimental setup (2/3) • Noise • Stationary noise (in a computer room) • Non-stationary noise (in an exhibition hall) • Noisy speech • Noise was added to clean speech at the average SN ratios of 20 dB and 10 dB, respectively.

Experimental setup (3/3)

Speaker identification using clean speech

Speaker identification result on NTT database (1/2) Speaker identification results using the combination of MFCC-based GMM and the original phase {θ}

Speaker identification result on NTT database (2/2) Speaker identification results using the combination of MFCC-based GMM and the modified phase {cosθ, sinθ}

Speaker identification result on JNAS database Speaker identification results using the combination of MFCC-based GMM and the modified phase {cosθ, sinθ}

Speaker identification under stationary/non-stationary noisy conditions

Speaker identification results under noisy conditions (1/2) Speaker identification rate (%) NTT database

Speaker identification results under noisy conditions (2/2) Speaker identification rate (%) JNAS database

Conclusion • We proposed a phase information extraction method which normalizes the change variation • of phase depending on the clipping position of • the input speech and integrates the phase • information with MFCC. • The experimental results showed that the combination of phase information and MFCC improved the speaker recognition performance remarkably than MFCC-based method.

Thank you for your attention!

Speaker Identification by Combining MFCC and Phase Information

Speaker Identification by Combining MFCC and Phase Information

Presentation Transcript

Phase identification by combining local composition from EDX with information from diffraction database

Speaker Identification Using a Pitch Detection Algorithm

Speaker Identification and Verification

Speaker recognition Phase 1: Detecting speech

Combining sentences By, JD

Speaker Identification using Gaussian Mixture Model

Language and Speaker Identification using Gaussian Mixture Model

Speaker Disclosure Information

Retrieving Spoken Information by Combining Multiple Speech Transcription Methods

A Speaker Pruning Algorithm for Real-Time Speaker Identification

Identification Information

Speaker Identification Using Wavelet Analysis and ANN

Text independent speaker identification in multilingual environments

Combining Fuzzy Information: An Overview

Combining Fuzzy Information: an Overview

Personal Identification Information

A Robust Speaker Identification System

Speaker Identification and Verification

Cepstrum and MFCC

Text independent speaker identification in multilingual environments

Speaker Identification of Customer and Agent using AWS