1 / 26

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005. Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005. Speech Synthesizer. Transcript. Voice talent speech. Corpus. New text. +. =.

andrew
Download Presentation

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Evaluation of Corpus-based Synthesizers:The Blizzard Challenge 2005 Christina Bennett Language Technologies Institute Carnegie Mellon University Student Research Seminar September 23, 2005

  2. Speech Synthesizer Transcript Voice talent speech Corpus New text + = New speech What is corpus-based speech synthesis?

  3. Need for Speech Synthesis Evaluation Motivation • Determine effectiveness of our “improvements” • Closer comparison of various corpus-based techniques • Learn about users' preferences • Healthy competition promotes progress and brings attention to the field

  4. Blizzard Challenge Goals Motivation • Compare methods across systems • Remove effects of different data by providing & requiring same data to be used • Establish a standard for repeatable evaluations in the field • [My goal:]Bring need for improved speech synthesis evaluation to forefront in community (positioning CMU as a leader in this regard)

  5. Blizzard Challenge: Overview Chal lenge • Released first voices and solicited participation in 2004 • Additional voices and test sentences released Jan. 2005 • 1 - 2 weeks allowed to build voices & synthesize sentences • 1000 samples from each system (50 sentences x 5 tests x 4 voices)

  6. Evaluation Methods Chal lenge • Mean Opinion Score (MOS) • Evaluate sample on a numerical scale • Modified Rhyme Test (MRT) • Intelligibility test with tested word within a carrier phrase • Semantically Unpredictable Sentences (SUS) • Intelligibility test preventing listeners from using knowledge to predict words

  7. Challenge setup: Tests Chal lenge • 5 tests from 5 genres • 3 MOS tests (1 to 5 scale) • News, prose, conversation • 2 “type what you hear” tests • MRT – “Now we will say ___ again” • SUS – ‘det-adj-noun-verb-det-adj-noun’ • 50 sentences collected from each system, 20 selected for use in testing

  8. Challenge setup: Systems Chal lenge • 6 systems: (random ID A-F) • CMU • Delaware • Edinburgh (UK) • IBM • MIT • Nitech (Japan) • Plus 1: “Team Recording Booth”(ID X) • Natural examples from the 4 voice talents

  9. Challenge setup: Voices Chal lenge • CMU ARCTIC databases • American English; 2 male, 2 female • 2 from initial release • bdl (m) • slt (f) • 2 new DBs released for quick build • rms (m) • clb (f)

  10. Challenge setup: Listeners Chal lenge • Three listener groups: • S – speech synthesis experts (50) • 10 requested from each participating site • V – volunteers (60, 97 registered*) • Anyone online • U – native US English speaking undergraduates (58, 67 registered*) • Solicited and paid for participation *as of 4/14/05

  11. Challenge setup: Interface Chal lenge • Entirely online http://www.speech.cs.cmu.edu/blizzard/register-R.html http://www.speech.cs.cmu.edu/blizzard/login.html • Register/login with email address • Keeps track of progress through tests • Can stop and return to tests later • Feedback questionnaire at end of tests

  12. Fortunately, Team X is clear “winner” Results

  13. Team D consistently outperforms others Results

  14. Speech experts are biased “optimistic” Results

  15. Speech experts are better in fact experts Results

  16. Voice results: Listener preference Results • slt is most liked, followed by rms • Type S: • slt - 43.48% of votes cast; rms - 36.96% • Type V: • slt - 50% of votes cast; rms - 28.26% • Type U: • slt - 47.27% of votes cast; rms - 34.55% • But, preference does not necessarily match test performance…

  17. Voice results: Test performance Results Female voices - slt

  18. Voice results: Test performance Results Female voices - clb

  19. Voice results: Test performance Results Male voices - rms

  20. Voice results: Test performance Results Male voices - bdl

  21. Voice results: Natural examples Results What makes natural rms different?

  22. Voice results: By system Results • Only system B consistent across listener types: (slt best MOS, rms best WER) • Most others showed group trends, i.e. (with exception of B above and F*) • S: rms always best WER, often best MOS • V: slt usually best MOS, clb usually best WER • U: clb usually best MOS and always best WER  Again, people clearly don’t preferthe voices they most easily understand

  23. Lessons learned: Listeners Lessons • Reasons to exclude listener data: • Incomplete test, failure to follow directions, inability to respond (type-in), unusable responses • Type-in tests very hard to process automatically: • Homophones, misspellings/typos, dialectal differences, “smart” listeners • Group differences: • V most variable, U most controlled, S least problematic but not representative

  24. Lessons learned: Test design Lessons • Feedback re tests: • MOS: Give examples to calibrate scale (ordering schema); use multiple scales (lay-people?) • Type-in: Warn about SUS; hard to remember SUS; words too unusual/hard to spell • Uncontrollable user test setup • Pros & Cons to having natural examples in the mix • Analyzing user response (+), differences in delivery style (-), availability of voice talent (?)

  25. Goals Revisited Lessons • One methodology clearly outshined rest • All systems used same data allowing for actual comparison of systems • Standard for repeatable evaluations in the field was established • [My goal:]Brought attention to need for better speech synthesis evaluation (while positioning CMU as the experts)

  26. For the Future Future • (Bi-)Annual Blizzard Challenge • Introduced at Interspeech 2005 special session • Improve design of tests for easier analysis post-evaluation • Encourage more sites to submit their systems! • More data resources (problematic for the commercial entities) • Expand types of systems accepted (& therefore test types) • e.g. voice conversion

More Related