170 likes | 330 Views
ALISP based improvement of GMM ’ s for Text-independent Speaker Verification. Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI, Paris 3-4 December 2003, Biometrics Tutorials, Uni. Fribourg. Overview.
E N D
ALISP based improvement of GMM’s for Text-independent Speaker Verification Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani1 Gérard Chollet2 1: DIVA Group, University of Fribourg2: GET-ENST, CNRS-LTCI, Paris 3-4 December 2003, Biometrics Tutorials, Uni. Fribourg
Overview 1. Why segmental speaker verification systems ? 2. Speech segmentation problems 3. Proposed segmental system based on DTW distance measure 4. Experimental setup 5. Results 6. Conclusions and perspectives
1 Why segmental speaker verification systems ? • Current reference speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated independently) • Speech is composed of different sounds • Phonemes have different discriminant characteristics for speaker verification (see Eatock, al. ‘94, J.Olsen ‘97, Petrovska al.’98, 2000…) • nasals and vowels convey more speaker characteristics than other speech classes • we would like to exploit this fact • We need a automatic speechsegmentation tool !
1.1 Advantages and disadvantages of the speech segmentation • Problems: • Need of a speech segmentation tool • Speaker modeling per speech classes => more data needed • More complicated systems • Advantages • Possibility to use it in combination with a dialogue based systems, for which a speech segmentation is already done • Possibility to introduce text-prompted speaker verification, designed to include a maximum number of speaker specific units
2 Speech Segmentation • Large Vocabulary Continuous Speech Recognition (LVCSR) System • good results for a small set of languages • need huge amount of annotated speech data • language (and task) dependent • we do not have such a for American English
2.1 ALISP Speech Segmentation • Data-driven speech segmentation • not yet usable for speech recognition purposes • no annotated databases needed • language and task independent • we could use it to segment the speech data for a text-independent speaker verification task • We will use the data driven speech segmentation method ALISP(Automatic Language Independent Speech Processing)
3 Proposed speaker verification system: ALISP segments and DTW 3.1 Segmentation problem • Segmentation of the speech data with N ALISP HMM models • N= 64 speech classes • Need of (not transcribed) speech data, to train the 64 ALISP HMM models • With so much speech classes we should change the speaker modeling method , not enough data for GMM adaptation===> • Use of Dynamic Time Warping (DTW)
3.2 DTW distance measure for speaker verification • Dynamic Time Warping (DTW) was already used for speaker verification, in a text-dependent mode (Rosenberg `76, Rabiner Schafer ’76, Furui ’81, Pandit and Kittler ’98…) • The DTW distance measure between two speech segments conveys speaker specific characteristics • Originality: used DTW intext-independent mode • We first proceed to the segmentation of speech data in ALISP classes • Measure the “distance “ between speaker and non-speaker segments • Speaker specific information is extracted from the : • ALISP based speech segments = > Client Dictionary • Non-speaker (world speakers) : • ALISP based speech segments => World Dictionary
3.3 Searching in the client and world speech dictionaries for speaker verification purposes
4 Evaluation of the proposed system: experimental setup • Development data: one subset from NIST 2002 cellular data (American English) • world speakers (60 female + 59 male): • used to train the ALISP speech segmenter • and to model the non-speakers (world speakers) • Evaluated on • another subset from NIST 2002 (111 + 79 male speakers)
4.1 Speech segmentation example • 2 another occurrences of the English phone : ay ; • the corresponding ALISP sequences: HX - Hf and (HM) - Hf - Ha- • previous slide : (Hf )-Ha or (HM) - HZ -Ha
4.5 Using only GMM’s scores to segments=> segmental Gmm system
5. Conclusions • State of the art NIST 2002 results for EER: (best 8% to worst 28%) • Fusion of classical system with a segmental systems : big improvements • Why: higher level informations present in the segmental system complement usefully the short therm frequency informations present in the GMM system