Enhancing Understandable Production of Massive Speech Synthesis Content: Analysis and Improvements

Understandable Production of Massive Synthesis Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University

Background • Applications pushing the limits of speech synthesis becoming more common • … • Issues with perceived quality and understandability arise more frequently with more challenging uses • Successful speech applications require understandable output • What can we do to make speech synthesis more understandable, even in challenging applications?

Massive Synthesis • Synthesis of such a large amount of content that typical evaluation methods are impractical • Frequently characterized by continuously generating new content • Examples: • Error reports • Business case summaries • News readers • Weblogs • Often able to simplify task, though not always

Example Content • Weblog content is ideal • Copious amounts available, continually generated • Easy to collect from many weblogs • ... but is it representative? • Use an existing collection: TREC Blog06 corpus • 11 weeks of over 100,000 RSS/Atom feeds from 2005-06 • Consists of homepage + all new permalink pages weekly • Over 750,000 total collected feeds • Includes spam content for realism • Multilingual, though only concerned with English for now

Analyzing Massive Synthesis Content • Preprocess to remove tags and meta content • Resulting corpus is 14GB of “blog” text • Identify word frequency differences from typical English • Try to find “blog-frequent” words unlikely to synthesize well • Flag and target for improvement strategies • Most words fairly normal for English • Frequency for differs, but words are not unusual • Most frequent atypical words: “html”, “blog” (27th/28th) • High occurrence of acronyms • “FAQ”, “mp3s”

Common Problems • Prevalence and variety of non-standard words • Technical jargon • Typos / Spelling errors • “the-teh”, “lose-loose”, “voila-viola”, etc. • l33t5p33k • Expressive spelling • “soooooo…..” • Usernames/handles • Must be rendered understandably to be useful • “leet” rather than “el-three-three-tee” • Can group NSW into classes to deal with them

General Improvements • Use formatting and structure to guide synthesis • Title, articles, comments, ads, links, … • Emphasized in text → emphasized in spoken output • Expressive spelling • Handle/ignore formatting problems • Missing HTML tags common • Improperly rendered HTML entities • Don’t say “ampersand hash eight two one two semicolon”

General Improvements • Content summarization • How to present very long content? • Several ways to summarize • Summarize articles and note existence of comments • Summarize articles and comments • Identify number of new articles and comments • More abstract ideas • Subsetting • Speak enough of the content to allow the user to decide to hear more or continue to the next item • Appropriate choice likely depends on user preferences

General Improvements • Phrase boundaries and prosody • Improved phrase breaks → more understandable synthesis • Effect amplified with informal writing • “word soup” • Multiple voices, non-speech output • Use different voices to segment content • “narrator”, “male commenter”, “female commenter”, etc. • Single voice with multiple styles may work as well • Use non-speech sounds to render some tokens • Laughing for “LOL” rather than trying to pronounce it

Evaluation • Synthesis evaluation is challenging • Typically evaluate independent of domain • Requires human listeners • Slow, expensive • Massive synthesis even harder • Too much content for listeners to evaluate • Evaluating some content likely to help • Especially if it’s chosen based on likelihood to have errors • Key to find as many errors as possible without listeners • Prioritize error correction

Simple Study • Implement several modifications for “weblog text” • “number-to-letter” rules • Syllable boundaries marked by case – “iTunes” • Lexical entries for common neologisms – “pwn” • Synthesize typical massive synthesis content • Entries from blogs, random Wikipedia article, Blog06 data • Subjects listen to 6 examples, one from each source • Asked to identify which version they prefer, and by how much on a scale of 1-5 • Subjects all speech synthesis experts

Study Examples STFU Newb March 14, 2006 8:02 AM Cyberbullying Report. It's a Microsoft sponsored report talking about intimidation and bullying online. There's a digested version of the survey [PDF]. And don't forget your dose of Cyber Wellness, too. posted by gsb (13 comments total) Does anyone else belive this just isn't happening? I mean back when I was a kid we tried to pwn eachothers IRC channels, but that was about it. posted by delmoi at 8:13 AM on March 14 I absolutely believe this happens. Kids are f%*#ing mean. Girls are viscious to each other. I would be more surprised if kids weren't using technology to expand those behaviours than if they are. posted by raedyn at 8:24 AM on March 14 [PDF] belive pwn eachothers posted by delmoi at 8:13 AM on March 14 f%*#ing posted by raedyn at 8:24 AM on March 14

Results • All subjects always preferred the modified examples • Less consistent agreement in degree of preference • Generally low preference scores • Implies only small improvements over baseline • Average preference score around 2 or 3 • Strong preferences rare • Sample size too small

Discussion • Some fairly simple modifications result in speech perceived at least slightly better • More changes might show more obvious preferences • Need more detailed information about how the speech was perceived • Anecdotal feedback suggests improved prosody will help significantly • Humans give hour-long lectures that people can understand, how can synthesizers do that?

Future Directions • Implement more understandability improvements • Time constraints, content structure, etc. • Perform a more complete evaluation • Not enough examples/listeners, but encouraging results • Need a more formalized evaluation metric • User feedback within an application with interested users • Hard to find sufficient users who would participate • Design an application to get users? • Web browser that renders content as speech: automatic podcast generator

Questions?

Enhancing Understandable Production of Massive Speech Synthesis Content: Analysis and Improvements

Enhancing Understandable Production of Massive Speech Synthesis Content: Analysis and Improvements

Presentation Transcript

Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions

Sugarcane Production in Egypt: Synthesis of Previous Research findings

Short, but clearly understandable

Synthesis Gas to Gasoline Production

SYNTHESIS OF

SPEECH PRODUCTION,RECOGNITION, ANALYSIS, AND SYNTHESIS

Human-Understandable Inference of Causal Relationships

Understandable Statistics

Detailed and understandable network diagnosis

Invention: Synthesis Implementation Cost Production

Production of Gasoline Components from Synthesis Gas

Production of Gasoline Components from Synthesis Gas

MCNPX Benchmark Tests of Neutron Production in Massive Lead Target

Synthesis gas production using secondary Reforming in ammonia production

Understandable Production of Massive Synthesis

Microbiology of synthesis gas fermentation for biofuel production

Understandable Concurrency

Production of Gasoline Components from Synthesis Gas

Production of Gasoline Components from Synthesis Gas

Injection Molding Manufacturer for massive finished production

Understandable and Scalable Concurrency

In Search of an Understandable Consensus Algorithm