1 / 17

Understandable Production of Massive Synthesis

Understandable Production of Massive Synthesis. Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University . Background. Applications pushing the limits of speech synthesis becoming more common …

eros
Download Presentation

Understandable Production of Massive Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understandable Production of Massive Synthesis Brian Langner, Alan W Black Language Technologies Institute Carnegie Mellon University

  2. Background • Applications pushing the limits of speech synthesis becoming more common • … • Issues with perceived quality and understandability arise more frequently with more challenging uses • Successful speech applications require understandable output • What can we do to make speech synthesis more understandable, even in challenging applications?

  3. Massive Synthesis • Synthesis of such a large amount of content that typical evaluation methods are impractical • Frequently characterized by continuously generating new content • Examples: • Error reports • Business case summaries • News readers • Weblogs • Often able to simplify task, though not always

  4. Example Content • Weblog content is ideal • Copious amounts available, continually generated • Easy to collect from many weblogs • ... but is it representative? • Use an existing collection: TREC Blog06 corpus • 11 weeks of over 100,000 RSS/Atom feeds from 2005-06 • Consists of homepage + all new permalink pages weekly • Over 750,000 total collected feeds • Includes spam content for realism • Multilingual, though only concerned with English for now

  5. Analyzing Massive Synthesis Content • Preprocess to remove tags and meta content • Resulting corpus is 14GB of “blog” text • Identify word frequency differences from typical English • Try to find “blog-frequent” words unlikely to synthesize well • Flag and target for improvement strategies • Most words fairly normal for English • Frequency for differs, but words are not unusual • Most frequent atypical words: “html”, “blog” (27th/28th) • High occurrence of acronyms • “FAQ”, “mp3s”

  6. Common Problems • Prevalence and variety of non-standard words • Technical jargon • Typos / Spelling errors • “the-teh”, “lose-loose”, “voila-viola”, etc. • l33t5p33k • Expressive spelling • “soooooo…..” • Usernames/handles • Must be rendered understandably to be useful • “leet” rather than “el-three-three-tee” • Can group NSW into classes to deal with them

  7. General Improvements • Use formatting and structure to guide synthesis • Title, articles, comments, ads, links, … • Emphasized in text → emphasized in spoken output • Expressive spelling • Handle/ignore formatting problems • Missing HTML tags common • Improperly rendered HTML entities • Don’t say “ampersand hash eight two one two semicolon”

  8. General Improvements • Content summarization • How to present very long content? • Several ways to summarize • Summarize articles and note existence of comments • Summarize articles and comments • Identify number of new articles and comments • More abstract ideas • Subsetting • Speak enough of the content to allow the user to decide to hear more or continue to the next item • Appropriate choice likely depends on user preferences

  9. General Improvements • Phrase boundaries and prosody • Improved phrase breaks → more understandable synthesis • Effect amplified with informal writing • “word soup” • Multiple voices, non-speech output • Use different voices to segment content • “narrator”, “male commenter”, “female commenter”, etc. • Single voice with multiple styles may work as well • Use non-speech sounds to render some tokens • Laughing for “LOL” rather than trying to pronounce it

  10. Evaluation • Synthesis evaluation is challenging • Typically evaluate independent of domain • Requires human listeners • Slow, expensive • Massive synthesis even harder • Too much content for listeners to evaluate • Evaluating some content likely to help • Especially if it’s chosen based on likelihood to have errors • Key to find as many errors as possible without listeners • Prioritize error correction

  11. Simple Study • Implement several modifications for “weblog text” • “number-to-letter” rules • Syllable boundaries marked by case – “iTunes” • Lexical entries for common neologisms – “pwn” • Synthesize typical massive synthesis content • Entries from blogs, random Wikipedia article, Blog06 data • Subjects listen to 6 examples, one from each source • Asked to identify which version they prefer, and by how much on a scale of 1-5 • Subjects all speech synthesis experts

  12. Study Examples STFU Newb March 14, 2006 8:02 AM Cyberbullying Report. It's a Microsoft sponsored report talking about intimidation and bullying online. There's a digested version of the survey [PDF]. And don't forget your dose of Cyber Wellness, too. posted by gsb (13 comments total) Does anyone else belive this just isn't happening? I mean back when I was a kid we tried to pwn eachothers IRC channels, but that was about it. posted by delmoi at 8:13 AM on March 14 I absolutely believe this happens. Kids are f%*#ing mean. Girls are viscious to each other. I would be more surprised if kids weren't using technology to expand those behaviours than if they are. posted by raedyn at 8:24 AM on March 14 [PDF] belive pwn eachothers posted by delmoi at 8:13 AM on March 14 f%*#ing posted by raedyn at 8:24 AM on March 14

  13. Results • All subjects always preferred the modified examples • Less consistent agreement in degree of preference • Generally low preference scores • Implies only small improvements over baseline • Average preference score around 2 or 3 • Strong preferences rare • Sample size too small

  14. Discussion • Some fairly simple modifications result in speech perceived at least slightly better • More changes might show more obvious preferences • Need more detailed information about how the speech was perceived • Anecdotal feedback suggests improved prosody will help significantly • Humans give hour-long lectures that people can understand, how can synthesizers do that?

  15. Future Directions • Implement more understandability improvements • Time constraints, content structure, etc. • Perform a more complete evaluation • Not enough examples/listeners, but encouraging results • Need a more formalized evaluation metric • User feedback within an application with interested users • Hard to find sufficient users who would participate • Design an application to get users? • Web browser that renders content as speech: automatic podcast generator

  16. Questions?

More Related