1 / 1

AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

A1. V1. A2. V2. A3. V3. A4. V4. AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE. David Dean, Patrick Lucey and Sridha Sridharan Speech, Audio, Image and Video Research Laboratory, QUT. ABSTRACT

sirius
Download Presentation

AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A1 V1 A2 V2 A3 V3 A4 V4 AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE David Dean, Patrick Lucey and Sridha SridharanSpeech, Audio, Image and Video Research Laboratory, QUT ABSTRACT The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identification systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decision-fusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions. Acoustic Speaker Models Acoustic (MFCC) Feature Extraction Speaker Decision Fusion Visual (PCA) Feature Extraction Visual Speaker Models Lip Location and Tracking LIP LOCATION AND TRACKING MODELLING AND FUSION For these experiments, text-dependent speaker modelling was performed, meaning that the speakers say the same utterance for both training and testing. Ten experiments were ran, one for each digit in the CUAVE database. Using phoneme transcriptions obtained using earlier research on the CUAVE database, speaker independent HMMs were trained for each phoneme separately in both the audio and visual modalities. The structure of these HMMs was identical for both modalities, with 3 hidden states for the phone models, and 1 hidden state for the short pause model. These speaker independent models were then adapted using MLLR adaption into speaker-dependent models. For each word under test in the CUAVE database, the ten most likely speakers in either modality are combined using weighted-sum late fusion of the normalized scores from each modality. The fusion experiments were conducted over a range of α values to discover the best fusion parameters at each noise level. RESULTS Figure A shows the response of the fused system to speech-babble audio corruption over a range of signal-to-noise ratios (SNRs). The dashed line indicates the best possible performance at each noise level. The performance of this system is comparable to existing research on XM2VTS and other audio-visual databases. From Figure A, it would appear that the best non-adaptive fusion performance is obtained for relatively low audio-weights. This is confirmed by Figure B, which shows the optimal audio-weight for best performance at each noise level favors the visual domain (is < 0.5) for all but the clean acoustic speech. FEATURE EXTRACTION Acoustic features were represented by the first 15 Mel frequency cepstral coefficients (MFCCs), a normalised energy coefficient and the first and second derivatives of these 16 coefficients. Acoustic features were captured at 100 samples/second. Visual features were represented by a 50-dimensional vector consisting of each tracked-lip image projected into a PCA-space based on 1,000 randomly selected tracked-lip images. Visual features were captured at 29.97 samples/second.

More Related