470 likes | 587 Views
The NIST Speaker Recognition Evaluations Alvin F Martin alvinfmartin@gmail.com. Odyssey 2012 @ Singapore 27 June 2012. Outline. Some Early History Evaluation Organization Performance Factors Metrics Progress Future. Some Early History. Success of speech recognition evaluation
E N D
The NIST Speaker Recognition EvaluationsAlvin F Martinalvinfmartin@gmail.com Odyssey 2012 @ Singapore 27 June 2012
Outline • Some Early History • Evaluation Organization • Performance Factors • Metrics • Progress • Future Odyssey 2012 @ Singapore
Some Early History • Success of speech recognition evaluation • Showed benefits of independent evaluation on common data sets • Collection of early corpora, including TIMIT, KING, YOHO, and especially Switchboard • Multi-purpose corpus collected (~1991) with speaker recognition in mind • Followed by Switchboard-2 and similar collections • Linguistic Data Consortium created in 1992 to support further speech (and text) collections in US • The first “Odyssey” – Martigny 1994, followed by Avignon, Crete, Toledo, etc. • Earlier NIST speaker evaluations in ‘92, ’95 • ‘92 evaluation had several sites as part of DARPA program • ‘95 evaluation with 6 sites used some Switchboard-1 data • Emphasis was on speaker id rather than open set recognition Odyssey 2012 @ Singapore
Martigny 1994 • Varying corpora and • performance measures • made meaningful • comparisons difficult Odyssey 2012 @ Singapore
Avignon 1998 • 19th February 1998: WORKSHOP RLA2C - Speaker Recognition • **************************************************** • RLA2C - RLA2C - RLA2C - RLA2C - RLA2C - RLA2C * **************************************************** • ------------------- ------------------- • la Reconnaissance Speaker • du Locuteur Recognition • et ses and its • Applications Commercial • Commerciales and Forensic • et Criminalistiques Applications • ------------------------------------------------- • AVIGNON 20-23 avril/april 1998 • Soutenu / Sponsored by • GFCP - SFA - ESCA - IEEE TIMIT was preferred corpus Sometimes bitter debate over forensic capabilities Odyssey 2012 @ Singapore
Avignon Papers Odyssey 2012 @ Singapore
Crete 2001 First official “Odyssey More emphasis on evaluation 2001: A Speaker Odyssey - The Speaker Recognition Workshop June 18-22, 2001 Crete, Greece Odyssey 2012 @ Singapore
Toledo 2004 First Odyssey with NIST SRE Workshop held in conjunction at same location First to include language recognition. Two notable keynotes on forensic recognition. Well attended. Odyssey held bi-annually since 2004. Odyssey 2012 @ Singapore
Etc. – Odyssey 2006, 2008, 2010, 2012, … Odyssey 2012 @ Singapore
Organizing Evaluations • Which task(s)? • Key principles • Milestones • Participants Odyssey 2012 @ Singapore
Which Speaker Recognition Problem? • Access Control? • Text independent or dependent? • Prior probability of target high • Forensic? • Prior not clear • Person Spotting? • Prior probability of target low • Text independent • NIST evaluations concentrated on the speaker spotting task, emphasizing the low false alarm region of performance curve Odyssey 2012 @ Singapore
Some Basic Evaluation Principles • Speaker spotting primary task • Research system oriented • Pooling across target speakers • Emphasis on low false alarm rate operating point with scores and decisions (calibration matters) Odyssey 2012 @ Singapore
Organization Basics • Open to all willing participants • Research-oriented • Commercialized competition discouraged • Written evaluation plans • Specified rules of participation • Workshops limited to participants • Each site/team must be represented • Evaluation data sets subsequently published by the LDC Odyssey 2012 @ Singapore
1996 Evaluation Plan (cont’d) Odyssey 2012 @ Singapore
1996 Evaluation Plan (cont’d) 1. PROC plots are ROCs plotted on normal probability error (miss versus false alarm) plots Odyssey 2012 @ Singapore
DET Curve Paper – Eurospeech ‘97 Odyssey 2012 @ Singapore
Wikipedia DET Page Odyssey 2012 @ Singapore
Some Milestones • 1992 – DARPA program limited speaker identification evaluation • 1995 – Small identification evaluation • 1996 – First SRE in current series • 2000 – AHUMADA Spanish data, first non-English speech • 2001 – Cellular data, • 2001 – ASR transcripts provided • 2002 – FBI “forensic” database • 2002 – SuperSid Workshop following SRE • 2005 – Multiple languages with bilingual speakers Odyssey 2012 @ Singapore
Some Milestones (cont’d) • 2005 – Room mic recordings, cross-channel trials • 2008 – Interview data • 2010 – New decision cost function metric stressing even lower FA rate region • 2010 – High and low vocal effort, aging • 2010 – HASR (Human-Assisted Speaker Recognition) Evaluation • 2011 – BEST Evaluation, broad range of test conditions, included added noise and reverb • 2012 – Target Speakers Defined Beforehand Odyssey 2012 @ Singapore
Participation • Grew from fewer than a dozen to 58 sites in 2010 • MIT (Doug) provided workshop notebook covers listing participants • Big increase in participants after 2001 • Handling scores of participating sites becomes a management problem Odyssey 2012 @ Singapore
NIST 2004 Speaker Recognition Workshop Taller de Reconocimiento de Locutor 3-4 June 2004 Toledo, Spain Odyssey 2012 @ Singapore
NIST 2006 Speaker Recognition Workshop San Juan Puerto Rico 26-27 June 2006 Odyssey 2012 @ Singapore
Participating Sites # Incomplete * Not in SRE series Odyssey 2012 @ Singapore
This slide is from 2001: A Speaker Odyssey in Crete Odyssey 2012 @ Singapore
NIST Evaluation Data Set (cont’d) Odyssey 2012 @ Singapore
NIST Evaluation Data Set (cont’d) Odyssey 2012 @ Singapore
Performance Factors • Intrinsic • Extrinsic • Parametric Odyssey 2012 @ Singapore
Intrinsic Factors Relate to the speaker • Demographic factors • Sex • Age • Education • Mean pitch • Speaking style • Conversational telephone • Interview • Read text • Vocal effort • Some questions about definition and how to collect • Aging • Hard to collect sizable amounts of data with years of time separation Odyssey 2012 @ Singapore
Extrinsic Factors Relate to the collection environment • Microphone or telephone channel • Telephone channel type • Landline, cellular, VOIP • In earlier times, carbon vs. electret • Telephone handset type • Handheld, headset, earbud, speakerphone • Microphone type – matched, mismatched • Placement of microphone relative to speaker • Background noise • Room reverberation Odyssey 2012 @ Singapore
“Parametric” Factors • Train/test speech duration • Have tested 10 s up to ~half hour, more in ‘12 • Number of training sessions • Have tested 1 to 8, more in ‘12 • Language English has been predominant, but a variety of others included in some evaluations • Is better performance forEnglish due to familiarity and quantity of development data? • Cross-language trials a separate challenge Odyssey 2012 @ Singapore
Metrics • Equal Error Rate • Easy to understand • Not operating point of interest • Calibration matters • Decision Cost Function • CLLR • FA rate at fixed miss rate • E.g. 10% (lower for some conditions) Odyssey 2012 @ Singapore
Decision Cost Function CDet CDet= CMiss× PMiss|Target × PTarget+ CFalseAlarm× PFalseAlarm|NonTarget × (1-PTarget) • Weighted sum of miss and false alarm error probabilities • Parameters are the relative costs of detection errors, CMiss and CFalseAlarm, and the a priori probability of the specified target speaker, PTarget. • Normalize by best possible cost of system doing no processing (minimum of cost of always deciding “yes” or always deciding “no” ) Odyssey 2012 @ Singapore
Decision Cost Function CDet(cont’d) • Parameters 1996-2008 • Parameters 2010 • Change in 2010 (for core and extended tests) met with some skepticism, but outcome appeared satisfactory Odyssey 2012 @ Singapore
CLLR Cllr = 1/(2*log2) * ((Σlog(1+1/s)/NTT)+ (Σlog(1+s))/NNT)) where first summation is over target trials, second is over non- target trials, NTT and NNT are the numbers of target and non-target trials, respectively, and s represents a trial’s likelihood ratio • Information theoretic measure made popular in this community by Niko • Covers broad range of performance operating points • George has suggested limiting range to low FA rates Odyssey 2012 @ Singapore
Fixed Miss Rate • Suggested in ‘96, was primary metric in BEST 2012: FA rate corresponding to 10% miss rate • Easy to understand • Practical for applications of interest • May be viewed as cost of listening to false alarms • For easier conditions, a 1% miss rate now more appropriate Odyssey 2012 @ Singapore
Recording Progress • Difficult to assure test set comparability • Participants encouraged to run prior systems on new data • Technology changes • In ‘96 landline phones predominated, with carbon button or electret microphones • Need to explore VOIP • With progress, want to make the test harder • Always want to add new evaluation conditions, new bells and whistles • More channel types, more speaking styles, languages, etc. • Externally added noise and reverb explored in 2011 with BEST • Doug’s history slide - updated Odyssey 2012 @ Singapore
History Slide Odyssey 2012 @ Singapore
Future • SRE12 • Beyond Odyssey 2012 @ Singapore
SRE12 Plans • Target speakers specified in advance • Speakers in recent past evaluations (in the thousands) • All prior speech data available for training • Some new targets with training provided at evaluation time • Test segments will include non-target speakers • New interview speech provided in 16-bit linear pcm • Some test calls collected in noisy environments • Artificial noise added to some test segment data • Will this be an effectively easier id task? • Will the provided set of known targets change system approaches? • Optional conditions include • Assume test speaker is one of the known targets • Use no information about targets other than that of the trial Odyssey 2012 @ Singapore
SRE12 Metric • Log-likelihood ratios will now be required • Therefore, no hard decisions are asked for • Primary metric will be an average of two detection cost functions, one using the SRE10 parameters, the other a target prior an order of magnitude greater (details on next slide) • Adds to stability of cost measure • Emphasizes need for good score calibration over wide range of log likelihoods • Alternative metrics will be Cllr and Cllr-M10, where the latter is Cllr limited to trials for which PMiss > 10% Odyssey 2012 @ Singapore
SRE12 Primary Cost Function • Niko noted that estimated llr’s making good decisions at a single operating point may not be effective at other operating points; therefore an average of two points is used • Writing DCF as PMiss + β * PFA where β = (CFA/CMiss) * (1 – PTarget) / PTarget • We take as cost function (DCF1 + DCF2)/2 where PTarget-1 = 0.01, PTarget-2 = 0.001, with always CMiss = CFA = 1 Odyssey 2012 @ Singapore
Future Possibilities • SRE12 outcome will determine whether pre-specified targets will be further explored • Does this make the problem too easy? • Artificially added noise and reverb may continue • HASR12 will indicate whether human-in-the-loop evaluation gains traction • SRE’s have become bigger undertakings • Fifty or more participating sites • Data volume approaching terabytes (as in BEST) • Tens or hundreds of millions of trials • Schedule could move to every three years Odyssey 2012 @ Singapore