450 likes | 544 Views
Multi-layer model for expressive speech perception and its application to expressive speech synthesis. Aug. 14, 2009 NCMMSC2009 Masato AKAGI (赤木 正人) , Professor School of Information Science, Japan Advanced Institute of Science and Technology. BASIC CONCEPTS.
E N D
Multi-layer model for expressive speech perception and its application to expressive speech synthesis Aug. 14, 2009 NCMMSC2009 Masato AKAGI (赤木 正人), Professor School of Information Science, Japan Advanced Institute of Science and Technology
BASIC CONCEPTS • Speech production and perception are human’s activities. • Study knowledge on speech production and perception as human’s activities and construct useful models for advanced sound processing systems. : AISL (Masato Akagi) : IIPL (Jianwu DANG)
Motivation: Global and Universal Communication • Speech is the most natural and important means of human-human communication in our daily life. • Even without the understanding of one language, we can still judge the expressive content of a voice, such as emotions. • Our study aims at; • constructing universal communication environments beyond languages, nations and cultures based on non-linguistic information, and • globalizing and universalizing human-human communications in which we can communicate among elders, infants, handicapped persons, etc. and/or machines as well as those in different languages, nations, and cultures
Problems • Toward being possible to communicate each other beyond languages, nations, and cultures, some biological common features in speech production and perception, independent of languages, nations, and cultures, has to be needed, that is; • Common organ movements for production, • Common features produced by common movements, • Common impression and brain activities caused by presenting common acoustic features, and • Common behaviors among communicators. • We have to; • discuss what are essential inspeech production and perception of non-linguistic informationin the chain structure, • find out biological common features among humans not depending on languages, nations and cultures, and • apply these common features to human-machine communications as well as human-human communications.
Modeling of emotional speech perception Target: Emotional Speech
PrimitiveFeatures Bright Heavy Strong It sounds Emotions Vagueness Nature Happy Sad Anger Acoustic Features He seems Speech voice Basic Concept:How do we define “angry” voice? • A voice where the power of components in the high frequency region is increased by 10 dB over their neutral one →Right, but… • Loud voice, shrill voice etc. →Usual • For emotional speech
Verification Bottom-Up Method Top-Down Method Emotional speech Neutral Sad Joy Bright Heavy Strong Primitive Feature F0 Spectrum Duration Acoustic Feature Construction Multi-layer model of auditory impression • Concept; • high-level psychological features like emotions (Neutral, Sad, Joy, etc.) are explained by semantic primitives described by relevant adjectives, • each semantic primitive is conveyed by certain physical acoustic features, and • each high-level psychological feature is related to certain semantic primitives and then related to acoustic features
Development of the model: For emotions • 5 Emotions: • Neutral, Joy, Sad, Cold Anger, and Hot Anger, selected from the database produced by the Fujitsu Laboratory and recorded by a professional actress • Experiment 1 • Examine utterances in terms of emotion • Experiment 2 • Construct a perceptual space of utterances in different categories using an MDS • Experiment 3 • Determine suitable primitive features for the perceptual model
Exp. 1 Examine subjects’ perception of expressive utterances
Joy Joy Normal Hot Anger j3 Normal j1 n1 n2 n3 j2 ha3 ha2 Hot Anger Cold Anger ha1 Cold Anger ca3 ca1 ca2 s2 s3 Sadness s1 Sad Exp. 2 Construct a psychological distance model and Exp. 3 Determine suitable primitive features
Appropriately describe human nature by fuzzy logic 60cm 40cm This time he is joyful 20cm joyful Very joyful Slightly joyful
Calculate a regression line to fit the output of FIS Slope of the regression line indicate the relationship • The absolute valueof slope is higher, the relationship is more closely related. • The slope is positive, the relationship is a positive correlate, vice versa.
Results are compatible with the way a human responds when they percept
Analyze acoustic features for building the relationship • 27 acoustic features were measured • Conduct correlation analysis • Select acoustic features that their coefficient is over 0.6 • 16 acoustic features most related to semantic primitives
Know human vagueness nature by the resulting relationships - Joy
Synthesis of Emotional Speech Target: Verify the three-layered model (From bottom to top)
Spectrum Duration Neutral Strong Bright Heavy Sad Joy F0 Bottom-Up Approach Approach is to construct a model for expressive speech perception Verify Expressive speech categories Semantic primitives Acoustic features
Implementation: Flow of Main Function Original Utterance Morphed Utterance
Control Parameters F1 F2 F3 SPTL SB RS HP AP RS1st F0Contour Spectrum PRAP PWR PRS1st RHT TL CL RCV Power Envelope Time Duration Implementation: Flow of Morphing Process
Modifying Acoustic Features • Decompose acoustic features to modify them independently TEMPORAL DECOMPOSITION STRAIGHT F0 target Speech wave F0 Event function SPECTRAL GMM AP target AP Event function Formant frequency Spectrum target Formant width Spectrum sequence Event function Formant gain POLYNOMIAL FITTING
Temporal decomposition (TD) • Temporal decomposition (Atal, 1983) {ak}: event targets {k(n)}: event functions N: number of frames K: number of events; K << N
Characteristics of TD • Characteristics • Merits in spectral modification Original algorithm • event functions: temporal evolution • event targets: “ideal” articulatory targets • high computation cost • sensitivity of number and locations of events MRTD (Nguyen and Akagi, 2003) • “well-shapedness” property: efficient model of temporal evolution • event targets: speaker’s identity • flexibly to modify speech signals • ensure smoothness of modified speech • model temporal evolution
Modeling of event functions • Identification of event locations • based on phoneme • Modeling of event functions • using polynomial fitting • In time domain, • flexible to perform time-scale modification • ensure the smoothness of modified speech
Spectrum modeling • Spectrum modeling using GMM (Zolfaghari et al., 1996) • fit a GMM to a smoothed spectrum • Gaussian mixture model • Flowchart to estimate spectral-GMM parameters m: mean, m: standard deviation, m: mixture weight M: number of components
Characteristics of spectral- GMM • Advantages • fit a GMM to a smoothed spectrum • spectral-GMM: formant information • Gaussians: locality property • Drawbacks • no constraints to control Gaussian positions control spectral-GMM parameters? 2 Gaussians: same peak
Solution to the problem constraints for Gaussian positions? 2 Gaussians: same peak
Spectral modification algorithm • Aims • spectrum modification • role of formants • spectrum modeling: spectral-GMMs • control spectral-GMMs in accordance with formants • Problems • spectral-GMM: related to formant information, but not real formants • no constraints of spectral-GMM parameters • Empirical rule
Spectral modification algorithm • Correspondences between spectral peaks and formant ranges? • Relations between spectral peaks and Gaussians? • Change mean values of Gaussians in accordance with formant shift factors
Example of spectral modification LPC Spectral-GMM • F1 = 30%, F2 = -10%, F3 = 20%, and F4 = 15% • LP-based method (left), and our proposed method (right)
Will bright voice really be affected by HP, AP, and F2? Is a voice sound brighter? Purpose Verify the combination of acoustic features Development Use analyzed results as parameters of rules 2. Intensity Rule 1. Base Rule Purpose Verify how semantic primitives is affected by acoustic features, i.e, the width of lines between them Development Adjust parameters of base rules Develop base and intensity rules for semantic primitive perception
N (Neutral) < SU1 < SU2 < SU3 Results show that morphed SR-intensity rules were perceived as the intensity level (1)
N (Neutral) < SU1 < SU2 < SU3 Results show that morphed SR-intensity rules were perceived as the intensity level (2)
Will Joy voice really sound bright, unstable, and clear, but not quiet and weak? Is a brighter voice sound more Joyful ? Purpose Verify how a expressive speech category is affected by semantic primitives, i.e, the width and the style of lines between them. Development Adjust parameters of base rules 2. Intensity Rule 1. Base Rule Purpose Verify the combination of semantic primitives Development Use FIS as parameters of rules Develop base- and intensity rules for expressive category perception
N (Neutral) < EU1 < EU2 < EU3 Results show that morphed ER-intensity rules were perceived as the intensity level We are applying this model to emotion recognition in speech.
Comparison between Japanese and Mandarin listeners Related semantic primitives of each expressive speech category selected by Mandarins (a) and Japanese (b) (a) (b) 1. The first 10 semantic primitives that were shared by both Mandarin and Japanese listeners have the same valence (i.e., positive or negative correlation). 2. 6 of the 10 semantic primitives are associated with the same two acoustic features that have the highest correlations. 3. bright, dark, low, heavy, and clear are associated with average pitch (AP) and highest pitch (HP), and strong is associated with power range (PWR) and mean value of power range in accentual phrase (PRAP).
Appendix:Development of the model: For singing voice • The same method can be applied to singing voice synthesis.
Demonstration (1) • Speaking voice (input): (male) (female) • Synthesized singing voice: (male) (female) (chorus) We took the first place in SINGING SYNTHESIS CHALENGE held in the InterSpeech2007.
Demonstration (2) • Another register …: Falsetto
Summary • We introduced some activities in the perception parts of our ongoing research project. • The contents we showed are the multi-layer model for expressive speech perception and its application to expressive speech synthesis, that is, • emotional voice synthesis, and • singing voice synthesis. • We plan to illustrate effectiveness of the model in the future with many examples of applications.