170 likes | 188 Views
Understandable Production of Massive Synthesis. Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University . Background. Applications pushing the limits of speech synthesis becoming more common …
E N D
Understandable Production of Massive Synthesis Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University
Background • Applications pushing the limits of speech synthesis becoming more common • … • Issues with perceived quality and understandability arise more frequently with more challenging uses • Successful speech applications require understandable output • What can we do to make speech synthesis more understandable, even in challenging applications?
Massive Synthesis • Synthesis of such a large amount of content that typical evaluation methods are impractical • Frequently characterized by continuously generating new content • Examples: • Error reports • Business case summaries • News readers • Weblogs • Often able to simplify task, though not always
Example Content • Weblog content is ideal • Copious amounts available, continually generated • Easy to collect from many weblogs • ... but is it representative? • Use an existing collection: TREC Blog06 corpus • 11 weeks of over 100,000 RSS/Atom feeds from 2005-06 • Consists of homepage + all new permalink pages weekly • Over 750,000 total collected feeds • Includes spam content for realism • Multilingual, though only concerned with English for now
Analyzing Massive Synthesis Content • Preprocess to remove tags and meta content • Resulting corpus is 14GB of “blog” text • Identify word frequency differences from typical English • Try to find “blog-frequent” words unlikely to synthesize well • Flag and target for improvement strategies • Most words fairly normal for English • Frequency for differs, but words are not unusual • Most frequent atypical words: “html”, “blog” (27th/28th) • High occurrence of acronyms • “FAQ”, “mp3s”
Common Problems • Prevalence and variety of non-standard words • Technical jargon • Typos / Spelling errors • “the-teh”, “lose-loose”, “voila-viola”, etc. • l33t5p33k • Expressive spelling • “soooooo…..” • Usernames/handles • Must be rendered understandably to be useful • “leet” rather than “el-three-three-tee” • Can group NSW into classes to deal with them
General Improvements • Use formatting and structure to guide synthesis • Title, articles, comments, ads, links, … • Emphasized in text → emphasized in spoken output • Expressive spelling • Handle/ignore formatting problems • Missing HTML tags common • Improperly rendered HTML entities • Don’t say “ampersand hash eight two one two semicolon”
General Improvements • Content summarization • How to present very long content? • Several ways to summarize • Summarize articles and note existence of comments • Summarize articles and comments • Identify number of new articles and comments • More abstract ideas • Subsetting • Speak enough of the content to allow the user to decide to hear more or continue to the next item • Appropriate choice likely depends on user preferences
General Improvements • Phrase boundaries and prosody • Improved phrase breaks → more understandable synthesis • Effect amplified with informal writing • “word soup” • Multiple voices, non-speech output • Use different voices to segment content • “narrator”, “male commenter”, “female commenter”, etc. • Single voice with multiple styles may work as well • Use non-speech sounds to render some tokens • Laughing for “LOL” rather than trying to pronounce it
Evaluation • Synthesis evaluation is challenging • Typically evaluate independent of domain • Requires human listeners • Slow, expensive • Massive synthesis even harder • Too much content for listeners to evaluate • Evaluating some content likely to help • Especially if it’s chosen based on likelihood to have errors • Key to find as many errors as possible without listeners • Prioritize error correction
Simple Study • Implement several modifications for “weblog text” • “number-to-letter” rules • Syllable boundaries marked by case – “iTunes” • Lexical entries for common neologisms – “pwn” • Synthesize typical massive synthesis content • Entries from blogs, random Wikipedia article, Blog06 data • Subjects listen to 6 examples, one from each source • Asked to identify which version they prefer, and by how much on a scale of 1-5 • Subjects all speech synthesis experts
Study Examples STFU Newb March 14, 2006 8:02 AM Cyberbullying Report. It's a Microsoft sponsored report talking about intimidation and bullying online. There's a digested version of the survey [PDF]. And don't forget your dose of Cyber Wellness, too. posted by gsb (13 comments total) Does anyone else belive this just isn't happening? I mean back when I was a kid we tried to pwn eachothers IRC channels, but that was about it. posted by delmoi at 8:13 AM on March 14 I absolutely believe this happens. Kids are f%*#ing mean. Girls are viscious to each other. I would be more surprised if kids weren't using technology to expand those behaviours than if they are. posted by raedyn at 8:24 AM on March 14 [PDF] belive pwn eachothers posted by delmoi at 8:13 AM on March 14 f%*#ing posted by raedyn at 8:24 AM on March 14
Results • All subjects always preferred the modified examples • Less consistent agreement in degree of preference • Generally low preference scores • Implies only small improvements over baseline • Average preference score around 2 or 3 • Strong preferences rare • Sample size too small
Discussion • Some fairly simple modifications result in speech perceived at least slightly better • More changes might show more obvious preferences • Need more detailed information about how the speech was perceived • Anecdotal feedback suggests improved prosody will help significantly • Humans give hour-long lectures that people can understand, how can synthesizers do that?
Future Directions • Implement more understandability improvements • Time constraints, content structure, etc. • Perform a more complete evaluation • Not enough examples/listeners, but encouraging results • Need a more formalized evaluation metric • User feedback within an application with interested users • Hard to find sufficient users who would participate • Design an application to get users? • Web browser that renders content as speech: automatic podcast generator