Hideki Kawahara Wakayama University ATR-HIS

Exemplar-based Voice Quality Analysis and Controlusing a High Quality Auditory Morphing Procedure based on STRAIGHT Hideki Kawahara Wakayama University ATR-HIS

Why high quality? • Humans are very good at using voice quality in communicating non-linguistic and para-linguistic information. • -> We can discriminate voice quality very well. • -> But… only around natural speech sounds • -> Highly nonlinear systems need to be tested around their normal operating range. • -> Voice quality has to be tested using real voice. • -> It is crucial to provide means to control physical parameters of “real” voice in a well defined manner. • -> We need a very high quality analysis, modification and synthesis system.

Why exemplar based? • Rule based approach • For example…. • How to modify formant frequencies when modifying F0 to make modified speech to sound natural? • Desirable but virtually impossible • “Curse of dimensionality” • Exemplar based approach • Finding permissible trajectories in a parametric space that span real voice examples. • Rule is represented as a approximating function that can generate permissible trajectories.

Why exemplar based? • Rule first approach • For example…. • How to modify formant frequencies when modifying F0 to make modified speech to sound natural? • Desirable but virtually impossible • “Curse of dimensionality” • Example first approach • Finding permissible trajectories in a parametric space that span real voice examples. • Rule is represented as a approximating function that can generate permissible trajectories.

Rule-first approach: example original

How to improve the rule? • Need to test perceptual effects for all combinations of DF1, DF2, DF3, DF4,….. N levels for each D --> PN • Need to check spectral tilt, harmonic to noise ratio….. • ----> Combinatorial explosionCurse of dimensionality

Example-first approach Surprise Happiness Neutral Fear Anger Sadness /koNnitiwa/ (hello)

How morphing looks/sounds? /hai/ (yes)

Morphed speech Neutral-Anger 5 4 3 2 1 0 -0.25 0 0.25 0.5 0.75 1 1.25 Permissible trajectory Perceived naturalness ＊ Real speech Morphing rate

Morphed speech Neutral-Anger 5 4 3 2 1 0 -0.25 0 0.25 0.5 0.75 1 1.25 Permissible trajectory Perceived naturalness Interpolating morphing provides a permissibletrajectory under currentimplementation ＊ Real speech Morphing rate

Parameters that was morphed • F0 • Instantaneous frequency based method • Energy distributionon a time-frequency coordinate • Extended pitch synchronous analysis • Periodicity indexon a time-frequency coordinate • Hamonic to noise ration in each ERB band • Time-frequency coordinate • (Fine temporal structure) visualization

How it work for voice quality? • Morphing examples including extrapolation • Normal speech and shouting speech • Falsetto and normal speech • Normal speech and singing in forte

Concluding remarks • It is possible to use the same language based on this exemplar based approach, if we can share a common voice quality corpus like VOQUAL database. • It is possible to accumulate scientific and practical knowledge as a growing set of approximating functions. • STRAIGHT has to be improved to enable precise reproduction of varieties of voice quality. <-- This is my duty/responsibility.

Naturalness: partial morphing All All Co Co Int+F0 Co+F0 Co+F0 Int Happiness Sadness Int+F0 Int All All: all parameters Co: coordinate alignment only Int: intensity only Co+F0: coordinate and F0 were morphed Int+F0: intensity and F0 were morphed Co+F0 Co Int+F0 Anger Int

Hideki Kawahara Wakayama University ATR-HIS