1 / 31

Analysis and Synthesis of Shouted Speech

Analysis and Synthesis of Shouted Speech. Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio. Shout. Shout is the loudest mode of vocal communication It is used for increasing the signal-to-noise ratio ( SNR) when communicating over an interfering noise

ilana
Download Presentation

Analysis and Synthesis of Shouted Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio

  2. Shout • Shout is the loudest mode of vocal communication • It is used for increasing the signal-to-noise ratio (SNR) when communicating • over an interfering noise • over a distance • Shouting is also used for expressing emotions or intentions

  3. Properties of shout • Shout is produced by raising the subglottal pressure and increasing the vocal fold tension • In effect, shout is characterized by • Increased sound pressure level (SPL) • Increased fundamental frequency (f0) • Increased amplitudes in mid-frequencies (1—4 kHz) • Increased duration and energy of vowels • Decreased duration and energy of consonants • Less accurate articulation

  4. Why perform shout synthesis? • Fortunately, shouting is used rarely, but it is an essential part of human vocal communication • Shout synthesis may be required e.g. for creating speech with emotional content, and it can be used in human-computer interaction or in creating virtual worlds and characters

  5. In this study… • In this study • Normal and shouted speech was recorded • Properties of normal and shouted speech were analyzed • Methods for producing natural sounding HMM-based synthetic shout are investigated

  6. Recording of normal and shouted speech • Normal and shouted speech was recorded in an anechoid chamber • 22 Finnish speakers • 24 sentences of speech and shout from each speaker • A total of 1056 sentences • Subjects were asked to use very loud voice in shouting • In addition, a larger shouting corpus of 100 sentences was recorded from one male and one female for TTS purposes

  7. Acoustic analysis of shout • The following acoustic properties were analyzed from the recorded shouted and normal speech: • sound pressure level (SPL) • duration • fundamental frequency (f0) • spectrum • properties of the voice source: • shape of the glottal pulse • H1-H2 parameter • NAQ parameter

  8. Acoustic analysis of shout – Results • On average (speech  shout) • SPL increased 21 dB for females and 22 dB for males • Sentence duration increased 20% for females and 24% for males • f0 increased 71% for females and 152% for males • Spectrum was emphasized in the 1–4 kHz area

  9. Female Male Overall Voiced Unvoiced

  10. Problems… • Differences between normal speech and shout are large • This induces problems in many speech processing algorithms: • Due to high f0, the accurate estimation of speech spectrum is difficult • This is due to the biasing effect of the sparse harmonic structure of the shouted voice source • Especially linear prediction (LP) is prone to this type of bias

  11. Spectrum estimation of shout • The biasing effect of the harmonics must be reduced • For this purpose, e.g. weighted linear prediction (WLP)can be used • In WLP, the effect of the excitation to spectrum is reduced • This is done by weighting the squared residual with a specific function

  12. LP vs. weighted linear prediction (WLP) Conventional LP: Weighted LP:

  13. Spectrum estimation of shout • Following spectrum estimation methods were compared for normal speech and shout: • Conventional linear prediction (LP) • WLP with STE weight (STE-WLP) • WLP with AME weight (AME-WLP) • STE – short time energy • AME – attenuation of the main excitation

  14. LP vs. WLP in resynthesis • Subjective listening tests indicate that • WLP-AME performs best with normal speech • WLP-STE performs best with shout LP WLP-STE WLP-AME

  15. LP vs. WLP in HMM-based speech synthesis • Subjective listening tests indicate that WLP-STE is preferred in the synthesis of shout (by adaptation) Female Male

  16. Synthesis of shout (1) • HMM-based synthesis is a very flexible means to produce different speaking styles, such as shout Text Speech data Synthetic speech Synthesis Training Statistical model

  17. Synthesis of shout (2) • It is difficult to obtain large amounts of shout data, enough for constructing a TTS voice Shout data

  18. Synthesis of shout (3) • Statistical adaptation of the normal speech model was used to generate synthetic shouted speech Synthetic shout Text Speech data Synthesis Training Statistical model Adaptation Shout data

  19. Synthesis of shout (4) • Alternatively, using simple voice conversion technique, the synthetic speech can be converted into shouted speech Synthetic shout Text Speech data Synthesis Training Statistical model Voice conversion Shout data

  20. Evaluation (1) • The following speech types were selected for the test: • Natural normal speech • Natural shout • Synthetic normal speech • Synthetic shout (adapted) • Synthetic shout (voice conversion)

  21. Evaluation (2) • MOS style listening test: the following properties were rated: • How would you rate the quality of the speech sample? • How much the sample resembles shouting? • How much effort did speaker use for producing speech? • Scale from 1 to 5 with verbal anchors • Loudness of the speech samples was normalized so that the ratings are based on other aspects than SPL • 11 test subjects evaluated 50 samples each

  22. Results – Naturalness • Shout synthesis is rated lower in quality compared to normal speech synthesis (as expected) Normal synthesis Shout synthesis 26

  23. Results – Impression of shouting • The impression of shouting is, however, fairly well preserved Natural shout Synthetic shout 27

  24. Results – Vocal effort • Adaptation produces better impression of the used vocal effort compared to voice conversion method Adapted shout Voice conversion shout 28

  25. Summary (1) • Synthesis of shout is challenging for many reasons: • It is difficult to obtain large amounts of shout data with consistent quality • Differences between normal speech and shout are large, which induces problems in many speech processing algorithms • In this work, the biasing effect of high-pitched shout was reduced by using weighted linear predictive (WLP) methods • Subjective listening tests show the that WLP models work better with shout than conventional LP

  26. Summary (2) • In this study, synthetic shout was produced with two different techniques: • Adaptation • Voice conversion of the synthetic normal speech • Methods were rated equal in quality • Impression of shouting and the use of vocal effort were better preserved in the adapted shout

  27. Samples Male Female Thank you!

More Related