1 / 15

Hideki Kawahara Wakayama University ATR-HIS

Exemplar-based Voice Quality Analysis and Control using a High Quality Auditory Morphing Procedure based on STRAIGHT. Hideki Kawahara Wakayama University ATR-HIS. Why high quality?. Humans are very good at using voice quality in communicating non-linguistic and para-linguistic information.

chaim
Download Presentation

Hideki Kawahara Wakayama University ATR-HIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exemplar-based Voice Quality Analysis and Controlusing a High Quality Auditory Morphing Procedure based on STRAIGHT Hideki Kawahara Wakayama University ATR-HIS

  2. Why high quality? • Humans are very good at using voice quality in communicating non-linguistic and para-linguistic information. • -> We can discriminate voice quality very well. • -> But… only around natural speech sounds • -> Highly nonlinear systems need to be tested around their normal operating range. • -> Voice quality has to be tested using real voice. • -> It is crucial to provide means to control physical parameters of “real” voice in a well defined manner. • -> We need a very high quality analysis, modification and synthesis system.

  3. Why exemplar based? • Rule based approach • For example…. • How to modify formant frequencies when modifying F0 to make modified speech to sound natural? • Desirable but virtually impossible • “Curse of dimensionality” • Exemplar based approach • Finding permissible trajectories in a parametric space that span real voice examples. • Rule is represented as a approximating function that can generate permissible trajectories.

  4. Why exemplar based? • Rule first approach • For example…. • How to modify formant frequencies when modifying F0 to make modified speech to sound natural? • Desirable but virtually impossible • “Curse of dimensionality” • Example first approach • Finding permissible trajectories in a parametric space that span real voice examples. • Rule is represented as a approximating function that can generate permissible trajectories.

  5. Rule-first approach: example original

  6. Rule-first approach: example original

  7. How to improve the rule? • Need to test perceptual effects for all combinations of DF1, DF2, DF3, DF4,….. N levels for each D --> PN • Need to check spectral tilt, harmonic to noise ratio….. • ----> Combinatorial explosionCurse of dimensionality

  8. Example-first approach Surprise Happiness Neutral Fear Anger Sadness /koNnitiwa/ (hello)

  9. How morphing looks/sounds? /hai/ (yes)

  10. Morphed speech Neutral-Anger 5 4 3 2 1 0 -0.25 0 0.25 0.5 0.75 1 1.25 Permissible trajectory Perceived naturalness * Real speech Morphing rate

  11. Morphed speech Neutral-Anger 5 4 3 2 1 0 -0.25 0 0.25 0.5 0.75 1 1.25 Permissible trajectory Perceived naturalness Interpolating morphing provides a permissibletrajectory under currentimplementation * Real speech Morphing rate

  12. Parameters that was morphed • F0 • Instantaneous frequency based method • Energy distributionon a time-frequency coordinate • Extended pitch synchronous analysis • Periodicity indexon a time-frequency coordinate • Hamonic to noise ration in each ERB band • Time-frequency coordinate • (Fine temporal structure) visualization

  13. How it work for voice quality? • Morphing examples including extrapolation • Normal speech and shouting speech • Falsetto and normal speech • Normal speech and singing in forte

  14. Concluding remarks • It is possible to use the same language based on this exemplar based approach, if we can share a common voice quality corpus like VOQUAL database. • It is possible to accumulate scientific and practical knowledge as a growing set of approximating functions. • STRAIGHT has to be improved to enable precise reproduction of varieties of voice quality. <-- This is my duty/responsibility.

  15. Naturalness: partial morphing All All Co Co Int+F0 Co+F0 Co+F0 Int Happiness Sadness Int+F0 Int All All: all parameters Co: coordinate alignment only Int: intensity only Co+F0: coordinate and F0 were morphed Int+F0: intensity and F0 were morphed Co+F0 Co Int+F0 Anger Int

More Related