270 likes | 520 Views
Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005. Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005. Speech Synthesizer. Transcript. Voice talent speech. Corpus. New text. +. =.
E N D
Large Scale Evaluation of Corpus-based Synthesizers:The Blizzard Challenge 2005 Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005
Speech Synthesizer Transcript Voice talent speech Corpus New text + = New speech What is corpus-based speech synthesis?
Need for Speech Synthesis Evaluation Motivation • Determine effectiveness of our “improvements” • Closer comparison of various corpus-based techniques • Learn about users' preferences • Healthy competition promotes progress and brings attention to the field
Blizzard Challenge Goals Motivation • Compare methods across systems • Remove effects of different data by providing & requiring same data to be used • Establish a standard for repeatable evaluations in the field • [My goal:]Bring need for improved speech synthesis evaluation to forefront in community (positioning CMU as a leader in this regard)
Blizzard Challenge: Overview Chal lenge • Released first voices and solicited participation in 2004 • Additional voices and test sentences released Jan. 2005 • 1 - 2 weeks allowed to build voices & synthesize sentences • 1000 samples from each system (50 sentences x 5 tests x 4 voices)
Evaluation Methods Chal lenge • Mean Opinion Score (MOS) • Evaluate sample on a numerical scale • Modified Rhyme Test (MRT) • Intelligibility test with tested word within a carrier phrase • Semantically Unpredictable Sentences (SUS) • Intelligibility test preventing listeners from using knowledge to predict words
Challenge setup: Tests Chal lenge • 5 tests from 5 genres • 3 MOS tests (1 to 5 scale) • News, prose, conversation • 2 “type what you hear” tests • MRT – “Now we will say ___ again” • SUS – ‘det-adj-noun-verb-det-adj-noun’ • 50 sentences collected from each system, 20 selected for use in testing
Challenge setup: Systems Chal lenge • 6 systems: (random ID A-F) • CMU • Delaware • Edinburgh (UK) • IBM • MIT • Nitech (Japan) • Plus 1: “Team Recording Booth”(ID X) • Natural examples from the 4 voice talents
Challenge setup: Voices Chal lenge • CMU ARCTIC databases • American English; 2 male, 2 female • 2 from initial release • bdl (m) • slt (f) • 2 new DBs released for quick build • rms (m) • clb (f)
Challenge setup: Listeners Chal lenge • Three listener groups: • S – speech synthesis experts (50) • 10 requested from each participating site • V – volunteers (60, 97 registered*) • Anyone online • U – native US English speaking undergraduates (58, 67 registered*) • Solicited and paid for participation *as of 4/14/05
Challenge setup: Interface Chal lenge • Entirely online http://www.speech.cs.cmu.edu/blizzard/register-R.html http://www.speech.cs.cmu.edu/blizzard/login.html • Register/login with email address • Keeps track of progress through tests • Can stop and return to tests later • Feedback questionnaire at end of tests
Voice results: Listener preference Results • slt is most liked, followed by rms • Type S: • slt - 43.48% of votes cast; rms - 36.96% • Type V: • slt - 50% of votes cast; rms - 28.26% • Type U: • slt - 47.27% of votes cast; rms - 34.55% • But, preference does not necessarily match test performance…
Voice results: Test performance Results Female voices - slt
Voice results: Test performance Results Female voices - clb
Voice results: Test performance Results Male voices - rms
Voice results: Test performance Results Male voices - bdl
Voice results: Natural examples Results What makes natural rms different?
Voice results: By system Results • Only system B consistent across listener types: (slt best MOS, rms best WER) • Most others showed group trends, i.e. (with exception of B above and F*) • S: rms always best WER, often best MOS • V: slt usually best MOS, clb usually best WER • U: clb usually best MOS and always best WER Again, people clearly don’t preferthe voices they most easily understand
Lessons learned: Listeners Lessons • Reasons to exclude listener data: • Incomplete test, failure to follow directions, inability to respond (type-in), unusable responses • Type-in tests very hard to process automatically: • Homophones, misspellings/typos, dialectal differences, “smart” listeners • Group differences: • V most variable, U most controlled, S least problematic but not representative
Lessons learned: Test design Lessons • Feedback re tests: • MOS: Give examples to calibrate scale (ordering schema); use multiple scales (lay-people?) • Type-in: Warn about SUS; hard to remember SUS; words too unusual/hard to spell • Uncontrollable user test setup • Pros & Cons to having natural examples in the mix • Analyzing user response (+), differences in delivery style (-), availability of voice talent (?)
Goals Revisited Lessons • One methodology clearly outshined rest • All systems used same data allowing for actual comparison of systems • Standard for repeatable evaluations in the field was established • [My goal:]Brought attention to need for better speech synthesis evaluation (while positioning CMU as the experts)
For the Future Future • (Bi-)Annual Blizzard Challenge • Introduced at Interspeech 2005 special session • Improve design of tests for easier analysis post-evaluation • Encourage more sites to submit their systems! • More data resources (problematic for the commercial entities) • Expand types of systems accepted (& therefore test types) • e.g. voice conversion