Evaluation of Corpus based Synthesizers

Evaluation of Corpus based Synthesizers The Blizzard Challenge – 2005: Evaluating corpus-based speech synthesis on common datasets Alan W. Black and Keiichi Tokuda Large Scale Evaluation of Corpus based Synthesizers: Results and Lessons from the Blizzard Challenge 2005 Christina L. Bennett Presented by:Rohit Kumar

What are they Evaluating ? • Corpus based Speech Synthesis Systems • 2 primary Elements of any such system • Corpus (High quality speech data) • Approach to build a Text to Speech System • The Quality of the Text to Speech System developed by this Corpus based System is heavily tied with the Quality of Speech Corpus • How do we evaluate the Approach then ?? • Common Corpus (Database)

What are they Evaluating ? • Quality of the Approach • Not considering how good the corpus itself is • Capability to quickly build systems given the Corpus • TTS development has evolved from being a science to being a Toolkit • Again not considering the time to create the corpus. • Tug of War between Time taken to create a high quality corpus, fine tuning the system (Manual work) and Merit of the Approach itself. • Reliability of each Particular Evaluation Method (Black & Tokuda) • Reliability of each Listener Group forEvaluation (Black & Tokuda)(Bennett)

Alrite. How to Evaluate ? • Create Common Databases • Issues with common databases • Design parameters, Size of Databases, etc. • Non Technical Logistics: Cost of creating databases • Using the CMU-ARCTIC Databases

Alrite. How to Evaluate ? • Evaluate different Quality Measures • Quality is a realllllllllllly broad term • Intelligibility, Naturalness, etc.. • 5 Test of 3 types • 3 Mean opinion score tests- different domains • Novels (in-domain), News, Conversation • DRT/MRT • Phonetically confusable words embedded in sentences • Semantically Unpredictable Sentences • Create Common Databases

Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • Another 7th System added: Real Human Speech

Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Databases released in Phase 1 to develop approaches (Practice databases) • Another 2 Databases released in Phase 2 (with time bounded submission)

Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Participants choose a test and Complete it • Can do the whole set of test in multiple sessions • Evaluates 100 sentences per participant

Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Different types of Listeners • Speech Experts, Volunteers, US Undergrads • Special Incentive to take test 2nd time

Alrite. How to Evaluate ? • Evaluate different Quality Measures • 5 Test of 3 types • Create Common Databases • 6 Teams = 6 Systems : Different approaches • 2 Phase Challenge • Web based Evaluation • Different types of Listeners • Any Question about Evaluation Setup ?

Fine. So what did they get ? • Evaluation of 6 Systems + 1 Real Speech • Observations:Real Speech consistently BestLot of inconsistency across tests But Agreement on the Best SystemListener Groups V & U very similar for MOS test

Additional Agenda • Comparing Voices • Exit Poll • Votes for Voices • Inconsistencies between Votes and Scores • Consistency in votes of voices across Listener Groups

Additional Agenda (contd)..

Discussion • Numbers given are all averages • No variance figures • Consistency of scores of each system ?? • Ordering of Tests: Participant’s choice • Measuring Speed of Development ?? • Nothing in the Evaluation method as such to measure speed of development • Some of the participants who submitted papers about their system in this competition did give those figures • Also, no control on Number of Man-Hours, Computational Power • Testing Approach on Quality of Speech • Issues like How much computational effort it takes not looked at • Web based Evaluation (Black & Tokuda) • Uncontrolled Random variables • Participant’s Environment, Network connectivity • Ensuring usage of the Common Database (and no additional Corpus) • Voice Conversion: Similarity Tests (Black & Tokuda) • Word Error Rate calculation for Phonetically Ambiguous Pairs ? • Non-Native Participant’s effect on Word Error Rates (Bennett) • Homophone Words (Bean/Been) (Bennett) • Looking back what they were Evaluating

Evaluation of Corpus based Synthesizers

Evaluation of Corpus based Synthesizers

Presentation Transcript

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

Goals-Based Evaluation

Corpus Evaluation

Tools for Ontology-based Corpus Annotation

Corpus-based evaluation of prosodic phrase break prediction

Synthesizers

Towards a Methodology for a Corpus-Based Approach t o Translation Evaluation

Curriculum Based Evaluation

Interactive web-based learning of corpus-generated phrases

Corpus-based Schema Matching

Theory Based Evaluation Impact Evaluation

Developing Software Synthesizers

Corpus-based Semantics of Concession

Corpus-Based Analyses of English Evaluative Adjectives

A Corpus Based Computational Linguistics

Analog Synthesizers

Evaluation of Corpus based Synthesizers

Corpus-based word frequency lists

Corpus-based word frequency lists

A Corpus-based Study of Connectors: Research from the CAS Learner Corpus of English Essays

Theory Based Evaluation Impact Evaluation