140 likes | 263 Views
Evaluation of Corpus based Synthesizers. The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets Alan W. Black and Keiichi Tokuda Large Scale Evaluation of Corpus based Synthesizers: Results and Lessons from the Blizzard Challenge 2005 Christina L. Bennett.
E N D
Evaluation of Corpus based Synthesizers The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets Alan W. Black and Keiichi Tokuda Large Scale Evaluation of Corpus based Synthesizers: Results and Lessons from the Blizzard Challenge 2005 Christina L. Bennett Presented by:Rohit Kumar
What are they Evaluating ? • Corpus based Speech Synthesis Systems • 2 primary Elements of any such system • Corpus (High quality speech data) • Approach to build a Text to Speech System • The Quality of the Text to Speech System developed by this Corpus based System is heavily tied with the Quality of Speech Corpus • How do we evaluate the Approach then ?? • Common Corpus (Database)
What are they Evaluating ? • Quality of the Approach • Not considering how good the corpus itself is • Capability to quickly build systems given the Corpus • TTS development has evolved from being a science to being a Toolkit • Again not considering the time to create the corpus. • Tug of War between Time taken to create a high quality corpus, fine tuning the system (Manual work) and Merit of the Approach itself. • Reliability of each Particular Evaluation Method (Black & Tokuda) • Reliability of each Listener Group forEvaluation (Black & Tokuda)(Bennett)
Alrite. How to Evaluate ? • Create Common Databases • Issues with common databases • Design parameters, Size of Databases, etc. • Non Technical Logistics: Cost of creating databases • Using the CMU-ARCTIC Databases
Alrite. How to Evaluate ? • Evaluate different Quality Measures • Quality is a realllllllllllly broad term • Intelligibility, Naturalness, etc.. • 5 Test of 3 types • 3 Mean opinion score tests- different domains • Novels (in-domain), News, Conversation • DRT/MRT • Phonetically confusable words embedded in sentences • Semantically Unpredictable Sentences • Create Common Databases
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • Another 7th System added: Real Human Speech
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Databases released in Phase 1 to develop approaches (Practice databases) • Another 2 Databases released in Phase 2 (with time bounded submission)
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Participants choose a test and Complete it • Can do the whole set of test in multiple sessions • Evaluates 100 sentences per participant
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Different types of Listeners • Speech Experts, Volunteers, US Undergrads • Special Incentive to take test 2nd time
Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Different types of Listeners • Any Question about Evaluation Setup ?
Fine. So what did they get ? • Evaluation of 6 Systems + 1 Real Speech • Observations:Real Speech consistently BestLot of inconsistency across tests But Agreement on the Best SystemListener Groups V & U very similar for MOS test
Additional Agenda • Comparing Voices • Exit Poll • Votes for Voices • Inconsistencies between Votes and Scores • Consistency in votes of voices across Listener Groups
Discussion • Numbers given are all averages • No variance figures • Consistency of scores of each system ?? • Ordering of Tests: Participant’s choice • Measuring Speed of Development ?? • Nothing in the Evaluation method as such to measure speed of development • Some of the participants who submitted papers about their system in this competition did give those figures • Also, no control on Number of Man-Hours, Computational Power • Testing Approach on Quality of Speech • Issues like How much computational effort it takes not looked at • Web based Evaluation (Black & Tokuda) • Uncontrolled Random variables • Participant’s Environment, Network connectivity • Ensuring usage of the Common Database (and no additional Corpus) • Voice Conversion: Similarity Tests (Black & Tokuda) • Word Error Rate calculation for Phonetically Ambiguous Pairs ? • Non-Native Participant’s effect on Word Error Rates (Bennett) • Homophone Words (Bean/Been) (Bennett) • Looking back what they were Evaluating