1 / 47

The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com

The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com. Odyssey 2012 @ Singapore 27 June 2012. Outline. Some Early History Evaluation Organization Performance Factors Metrics Progress Future. Some Early History. Success of speech recognition evaluation

sharne
Download Presentation

The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The NIST Speaker Recognition EvaluationsAlvin F Martinalvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012

  2. Outline • Some Early History • Evaluation Organization • Performance Factors • Metrics • Progress • Future Odyssey 2012 @ Singapore

  3. Some Early History • Success of speech recognition evaluation • Showed benefits of independent evaluation on common data sets • Collection of early corpora, including TIMIT, KING, YOHO, and especially Switchboard • Multi-purpose corpus collected (~1991) with speaker recognition in mind • Followed by Switchboard-2 and similar collections • Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US • The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc. • Earlier NIST speaker evaluations in ‘92, ’95 • ‘92 evaluation had several sites as part of DARPA program • ‘95 evaluation with 6 sites used some Switchboard-1 data • Emphasis was on speaker id rather than open set recognition Odyssey 2012 @ Singapore

  4. Odyssey 2012 @ Singapore

  5. Martigny 1994 • Varying corpora and • performance measures • made meaningful • comparisons difficult Odyssey 2012 @ Singapore

  6. Avignon 1998 • 19th February 1998: WORKSHOP RLA2C - Speaker Recognition • **************************************************** • RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C * **************************************************** • ------------------- ------------------- • la Reconnaissance Speaker • du Locuteur Recognition • et ses and its • Applications Commercial • Commerciales and Forensic • et Criminalistiques Applications • ------------------------------------------------- • AVIGNON 20-23 avril/april 1998 • Soutenu / Sponsored by • GFCP - SFA - ESCA - IEEE TIMIT was preferred corpus Sometimes bitter debate over forensic capabilities Odyssey 2012 @ Singapore

  7. Avignon Papers Odyssey 2012 @ Singapore

  8. Crete 2001 First official “Odyssey More emphasis on evaluation 2001: A Speaker Odyssey - The Speaker Recognition Workshop June 18-22, 2001 Crete, Greece Odyssey 2012 @ Singapore

  9. Odyssey 2012 @ Singapore

  10. Odyssey 2012 @ Singapore

  11. Toledo 2004 First Odyssey with NIST SRE Workshop held in conjunction at same location First to include language recognition. Two notable keynotes on forensic recognition. Well attended. Odyssey held bi-annually since 2004. Odyssey 2012 @ Singapore

  12. Odyssey 2012 @ Singapore

  13. Etc. – Odyssey 2006, 2008, 2010, 2012, … Odyssey 2012 @ Singapore

  14. Organizing Evaluations • Which task(s)? • Key principles • Milestones • Participants Odyssey 2012 @ Singapore

  15. Which Speaker Recognition Problem? • Access Control? • Text independent or dependent? • Prior probability of target high • Forensic? • Prior not clear • Person Spotting? • Prior probability of target low • Text independent • NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve Odyssey 2012 @ Singapore

  16. Some Basic Evaluation Principles • Speaker spotting primary task • Research system oriented • Pooling across target speakers • Emphasis on low false alarm rate operating point with scores and decisions (calibration matters) Odyssey 2012 @ Singapore

  17. Organization Basics • Open to all willing participants • Research-oriented • Commercialized competition discouraged • Written evaluation plans • Specified rules of participation • Workshops limited to participants • Each site/team must be represented • Evaluation data sets subsequently published by the LDC Odyssey 2012 @ Singapore

  18. Odyssey 2012 @ Singapore

  19. 1996 Evaluation Plan (cont’d) Odyssey 2012 @ Singapore

  20. 1996 Evaluation Plan (cont’d) 1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots Odyssey 2012 @ Singapore

  21. DET Curve Paper – Eurospeech ‘97 Odyssey 2012 @ Singapore

  22. Wikipedia DET Page Odyssey 2012 @ Singapore

  23. Some Milestones • 1992 – DARPA program limited speaker identification evaluation • 1995 – Small identification evaluation • 1996 – First SRE in current series • 2000 – AHUMADA Spanish data, first non-English speech • 2001 – Cellular data, • 2001 – ASR transcripts provided • 2002 – FBI “forensic” database • 2002 – SuperSid Workshop following SRE • 2005 – Multiple languages with bilingual speakers Odyssey 2012 @ Singapore

  24. Some Milestones (cont’d) • 2005 – Room mic recordings, cross-channel trials • 2008 – Interview data • 2010 – New decision cost function metric stressing even lower FA rate region • 2010 – High and low vocal effort, aging • 2010 – HASR (Human-Assisted Speaker Recognition) Evaluation • 2011 – BEST Evaluation, broad range of test conditions, included added noise and reverb • 2012 – Target Speakers Defined Beforehand Odyssey 2012 @ Singapore

  25. Participation • Grew from fewer than a dozen to 58 sites in 2010 • MIT (Doug) provided workshop notebook covers listing participants • Big increase in participants after 2001 • Handling scores of participating sites becomes a management problem Odyssey 2012 @ Singapore

  26. NIST 2004 Speaker Recognition Workshop Taller de Reconocimiento de Locutor 3-4 June 2004 Toledo, Spain Odyssey 2012 @ Singapore

  27. NIST 2006 Speaker Recognition Workshop San Juan Puerto Rico 26-27 June 2006 Odyssey 2012 @ Singapore

  28. Participating Sites # Incomplete * Not in SRE series Odyssey 2012 @ Singapore

  29. This slide is from 2001: A Speaker Odyssey in Crete Odyssey 2012 @ Singapore

  30. NIST Evaluation Data Set (cont’d) Odyssey 2012 @ Singapore

  31. NIST Evaluation Data Set (cont’d) Odyssey 2012 @ Singapore

  32. Performance Factors • Intrinsic • Extrinsic • Parametric Odyssey 2012 @ Singapore

  33. Intrinsic Factors Relate to the speaker • Demographic factors • Sex • Age • Education • Mean pitch • Speaking style • Conversational telephone • Interview • Read text • Vocal effort • Some questions about definition and how to collect • Aging • Hard to collect sizable amounts of data with years of time separation Odyssey 2012 @ Singapore

  34. Extrinsic Factors Relate to the collection environment • Microphone or telephone channel • Telephone channel type • Landline, cellular, VOIP • In earlier times, carbon vs. electret • Telephone handset type • Handheld, headset, earbud, speakerphone • Microphone type – matched, mismatched • Placement of microphone relative to speaker • Background noise • Room reverberation Odyssey 2012 @ Singapore

  35. “Parametric” Factors • Train/test speech duration • Have tested 10 s up to ~half hour, more in ‘12 • Number of training sessions • Have tested 1 to 8, more in ‘12 • Language English has been predominant, but a variety of others included in some evaluations • Is better performance forEnglish due to familiarity and quantity of development data? • Cross-language trials a separate challenge Odyssey 2012 @ Singapore

  36. Metrics • Equal Error Rate • Easy to understand • Not operating point of interest • Calibration matters • Decision Cost Function • CLLR • FA rate at fixed miss rate • E.g. 10% (lower for some conditions) Odyssey 2012 @ Singapore

  37. Decision Cost Function CDet CDet= CMiss× PMiss|Target × PTarget+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget) • Weighted sum of miss and false alarm error probabilities • Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, PTarget. • Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” ) Odyssey 2012 @ Singapore

  38. Decision Cost Function CDet(cont’d) • Parameters 1996-2008 • Parameters 2010 • Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory Odyssey 2012 @ Singapore

  39. CLLR Cllr = 1/(2*log2) * ((Σlog(1+1/s)/NTT)+ (Σlog(1+s))/NNT)) where first summation is over target trials, second is over non- target trials, NTT and NNT are the numbers of target and non-target trials, respectively, and s represents a trial’s likelihood ratio • Information theoretic measure made popular in this community by Niko • Covers broad range of performance operating points • George has suggested limiting range to low FA rates Odyssey 2012 @ Singapore

  40. Fixed Miss Rate • Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate • Easy to understand • Practical for applications of interest • May be viewed as cost of listening to false alarms • For easier conditions, a 1% miss rate now more appropriate Odyssey 2012 @ Singapore

  41. Recording Progress • Difficult to assure test set comparability • Participants encouraged to run prior systems on new data • Technology changes • In ‘96 landline phones predominated, with carbon button or electret microphones • Need to explore VOIP • With progress, want to make the test harder • Always want to add new evaluation conditions, new bells and whistles • More channel types, more speaking styles, languages, etc. • Externally added noise and reverb explored in 2011 with BEST • Doug’s history slide - updated Odyssey 2012 @ Singapore

  42. History Slide Odyssey 2012 @ Singapore

  43. Future • SRE12 • Beyond Odyssey 2012 @ Singapore

  44. SRE12 Plans • Target speakers specified in advance • Speakers in recent past evaluations (in the thousands) • All prior speech data available for training • Some new targets with training provided at evaluation time • Test segments will include non-target speakers • New interview speech provided in 16-bit linear pcm • Some test calls collected in noisy environments • Artificial noise added to some test segment data • Will this be an effectively easier id task? • Will the provided set of known targets change system approaches? • Optional conditions include • Assume test speaker is one of the known targets • Use no information about targets other than that of the trial Odyssey 2012 @ Singapore

  45. SRE12 Metric • Log-likelihood ratios will now be required • Therefore, no hard decisions are asked for • Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide) • Adds to stability of cost measure • Emphasizes need for good score calibration over wide range of log likelihoods • Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10% Odyssey 2012 @ Singapore

  46. SRE12 Primary Cost Function • Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used • Writing DCF as PMiss + β * PFA where β = (CFA/CMiss) * (1 – PTarget) / PTarget • We take as cost function (DCF1 + DCF2)/2 where PTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1 Odyssey 2012 @ Singapore

  47. Future Possibilities • SRE12 outcome will determine whether pre-specified targets will be further explored • Does this make the problem too easy? • Artificially added noise and reverb may continue • HASR12 will indicate whether human-in-the-loop evaluation gains traction • SRE’s have become bigger undertakings • Fifty or more participating sites • Data volume approaching terabytes (as in BEST) • Tens or hundreds of millions of trials • Schedule could move to every three years Odyssey 2012 @ Singapore

More Related