660 likes | 1.08k Views
Emotions in Hindi -Recognition and Conversion. S.S. Agrawal CDAC, Noida & KIIT, Gurgaon email: ssagrawal@cdacnoida.in, ss_agrawal@hotmail.com. Contents. Intonation patterns with sentence type categories A relationship between F0 values in Vowels and Emotions –Analytical study
E N D
Emotions in Hindi-Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon email: ssagrawal@cdacnoida.in, ss_agrawal@hotmail.com
Contents • Intonation patterns with sentence type categories • A relationship between F0 values in Vowels and Emotions –Analytical study • Recognition and Perception of Emotions based spectral and prosodic values obtained from vowels. • F0 Pattern Analysis of Emotion sentences in Hindi. • Emotion conversion using the intonation data base from sentences and words . • Comparison of machine and Perception Experiments.
Intonation Patterns of Hindi • Hindi speech possesses pitch patterns depending on the meaning, structure and type. • Intonation also decides the meaning of certain words depending on the type of sentence or phrase where these occur. • In Hindi we observe three levels of intonations and these can be classified as ‘normal’, ‘high’ and ‘low’. • In exceptional cases presence of VH (very high) and EH(extremely high) is felt, though it rarely occurs. • For observing intonation patterns due to sentence type, we may classify them into the following seven categories namely - Affirmative, Negative, Interrogative, Imperative, Doubtful, Desiderative, Conditional, and Exclamatory.
Intonation patterns of Hindi Affirmative ( MHL pitch pattern ) Negative (MHL pitch pattern ) Imperative (ML pitch pattern) Doubtful
Intonation Patterns of Hindi Desiderative Exclamatory (MHM Pitch pattern)
???Application on Emotional Behavior • Recognition of Emotion • Conversion of Emotion
Emotion Recognition For natural human-machine interaction, there is a requirement of machine based emotional intelligence. For satisfactory responses to human emotions; computer systems need accurate emotion recognition. We can, therefore, monitor physiological state of individuals in several demanding work environments which can be used to augment automated medical or forensic data analysis systems.
METHOD Material Speakers: Six male graduate students (from drama club, Aligarh Muslim University, Aligarh), Native speakers of Hindi, Age group 20-23 years, Sentences: Short 5 neutral Hindi sentences, Emotions: Neutral, happiness, anger, sadness, and fear. Repetitions: 4 In this way there were 600 (6×5×5×4) sentences.
Recording Electretmicrophone Partially sound treated room “PRAAT” software. Sampling rate 16 kHz / 16 bit Distance between mouth and microphone was adjusted nearly 30 cm.
Listening test Above 600 sentences were first randomized in sentences and speakers, and then were presented to 20 naive listeners to evaluate the emotions within five categories: neutral, happiness, anger, sadness, and fear. Only those sentences, whose emotions were identified by at least 80% of all the listeners, were selected for this study. After selection, we had left with 400 sentences for our study
Acoustic Analysis Prosodyrelated features (mean value of pitch (F0), duration, rms value of sound pressure, and speech power. Spectral features (15 mel frequency cepstral coefficients )
Prosody featuresFor present study, central 60 ms portion of vowels /a/ occurring in all the sentences at different positions (underlined in sentences given in Appendix) from selected words of sentences were used to measure all the features. In this way there were total 13 /a/ vowels (3 in first sentence, 3 in second, 2 in third, 4 in four, and 1 in fifth sentence). After taking 60-60ms of all the /a/ vowels, average of all the vowels of each sentence was taken. Besides F0, speech power, and sound pressure were also calculated.
Feature extraction method Praat software was used to measure all the prosody features. Figure shows the representation of waveform (upper) and spectrogram (lower) (pitch in blue line) for word / sItar / (a) for anger (b) for fear (c) for happiness (d) for neutral and (e) for sadness as obtained in “PRAAT” software .
(a)An (b)Fe (c)Ha (d)Ne
A E I i Av Anger 237 218 222.2 244 234.6 Sadness 107 110.4 106.0 111 110.1 Neutral 134.5 131.0 132.5 146.9 136.3 Happiness 194.5 190.5 189.1 189.0 191.8 Fear 160.9 163.7 162.3 191.2 173.2 Table 1: F0 value for vowel
Spectral features MFCC coefficients were calculated using MATLAB programming. Frame duration was 16ms. Overlapping in frames was of 9ms From each frame, 3 MFCCs were calculated, and as we had five frames, so we obtained 15 MFCCs for each sample. Thus in total, there are 19 parameters. All the measured 19 parameters of sentences of each emotion were then normalized with respect to parameters of neutral sentences of the given speaker.
Recognition of emotion Independent variables : Measured acoustic parameters Dependent variables : Emotional categories. Recognition had been done by people as well as by neural network classifier. By people Selected 400 sentences were randomized sentence-wise and speaker-wise. These randomized sentences were presented to 20 native listeners of Hindi to identify the emotions within five categories: neutral, happiness, anger, sadness, and fear. All the listeners were educated from Hindi background and of age group of 18 to 28 years.
By Neural network classifier (using PRAAT software) For training 70% of data And for classification test 30% of data. As parameters were normalized with respect to neutral category, only four emotions (Anger, fear, happiness, and sadness) were recognized by classifier.
Contd…. • In present study 3-layered (two hidden layers and one output layer) feed forward neural network had been used, in which both hidden layers had 10-10 nodes. • There were 19 input units which represented used acoustic parameters. • Output layer had 4 units which represented output categories (4 emotions in present case). • Results were obtained using neural network classifier on 2000 training epochs and 1 run for eachdata set.
RESULT AND DISCUSSION Recognition of emotion By people Most recognizable emotion: Anger (82.3%) Least recognizable emotion: Fear (75.8%). Average recognition of emotion: 78.3 %. Recognition of emotion was in the order: anger > sadness > neutral > happy > fear
Table1.Confusion Matrix of recognition of emotion by People Performance
By neural network classifier (NNC) Confusion Matrix obtained by NNC is shown in Table2. Most recognizable emotion: Anger (90%), sadness (90%) Least recognizable emotion: Fear (60%). Average recognition of emotion: 80 %. The recognition of emotion was in the order: anger =sadness > happy > fear. In “Figure 2”, histogram of comparison of percentage correct recognition of emotion by people and NNC is shown.
Figure2 Comparison of percentage correct recognition of emotion by people and NNC
Intonation based Emotional Database • six native speakers , 20 sentences of Hindi utterances,five expressive styles • Neutral • Sadness • Anger • Surprise • Happy.
Happiness • F0 curve of utterances • rise and fall pattern at the beginning of the sentences • hold pattern at the end of the sentences
Anger • F0 curve of utterances • rise & fall in the beginning of the sentences. • fall towards the end of the sentences
Sadness • F0 contour of utterances of sadness • fall or hold at the end of sentences • fall & rise in the beginning of the sentences • fall-fall pattern throughout the contour
Normal • F0 curve of utterances • falls at the end of the utterances • rise & fall in the beginning of the sentences. • In most of the case we observed fall in sentence final position irrespective of the speaker
Surprise • F0 curve of utternaces • rise & fall pattern for sentence initial position • rise pattern for sentence final position • most of the utterances of surprise emotion in the form of question based surprise state.
Emotion Conversion • To store all utterances of all the expressive style is really a difficult and time consuming task. • Also consume huge memory space. • There should be an approach which minimizes the time and memory space for emotion rich database. • Taking this fact in consideration authors have proposed an algorithm for emotion conversion.
Contd… • This algorithm requires storing neutral utterances in the database . • Other expressive style utterances will be produced from neutral emotion. • Proposed algorithm is based on linear modification model (LMM), where fundamental frequency (F0) is one of the factors to convert emotions
Intonation based Emotional Database • Another database which is directly associated with the main module of emotion conversion. • The database is used to keep the pitch point values (Table 1) for the utterances, already present in the Speech Database. • The numbers of pitch points are based on number of syllables present in the sentence and resolution frequency (fr). • Resolution frequency is the minimum amount by which every remaining pitch point will lie above or below the line that connects two neighbours pitch points
Table: Pitch Point Table (Neutral emotion recorded by one of the speaker).
Sentence Pt1 (Hz) Pt2 Pt3 Pt4 Pt5 Pt6 Pt7 Pt8 Pt9 Pt10 Pt11 1 200.4 232.2 391.4 236.5 185.7 208.1 496.8 211.6 179.2 - - 2 244.2 213.8 262.6 - 159.6 210.0 172.6 177.4 - - - 3 200.9 262.3 231 259 219.7 175.8 87.7 88.8 207.6 201.9 - 4 200.1 231.1 183.8 230.1 234.6 188.7 173.2 152.1 246.8 233.7 - 5 227.3 255.3 220.1 249 189.9 231.7 166.7 221.5 187.4 170.5 203.4 6 232.9 252.7 197.7 237.5 205.5 258.3 206.8 246.3 201.9 193.5 - 7 205.7 237.6 203.4 228.2 165.9 202.1 - - - - - 8 260.9 230.1 251.8 211.6 238.3 200.3 98.4 94.2 202.3 182 - 9 258.5 215.5 202.3 233.7 175.8 144.3 83 197.5 181.9 - - 10 229 203.9 316.8 229.4 207.2 79 256.8 192.8 202.4 148.3 193 11 208.5 201.8 235 203.5 216.7 507.9 489 216.2 168.4 85.3 96.7 12 253.1 223.5 251.4 221.6 249.9 189.7 172.7 85.6 89.2 203.3 186.8 13 229.4 204 273.6 198.3 240.7 200.3 234 161 198.3 - - 14 244.6 265.6 224.4 280.5 198.4 265.6 165.7 191.7 - - - 15 259.6 209.7 308.8 235.6 224.7 252.5 205.4 177.4 - - - 16 210.6 223.2 181.3 91.3 93.6 - - - - - - 17 277 225.1 107.3 105.2 229.7 110.9 108.8 211.1 198.4 98.1 93.4 18 273 234.4 262.9 204.4 228 506.0 257.6 180.5 185.7 - - 19 264.8 219.6 254.5 195.9 225.1 184.2 192.7 97.5 189.9 209.8 178 20 242.5 207.9 257.7 179 201.3 162.8 191.2 - - - -
F0 Based Emotion Conversion • Emotion conversion at Sentence level • Emotion conversion at Word level
F0 Based Emotion Conversion • In these methods pitch points (Pi) were studied for the desired source emotion (Neutral) and target emotion (Surprise) and then difference between corresponding pitch points were evaluated after normalization • This serves as an indicator of the values by which pitch points of source speech utterance must be increased or decreased to convert it to target utterance. • For pitch analysis step length is taken as .01 second and minimum and maximum pitch is taken as 75 Hz and 500 Hz. Then stylization process is performed to remove the excess pitch points and then valid numbers of pitch points were noted
F0 Based Emotion Conversion… • On comparison between source and target emotion training set, pitch points are divided in four groups and set the initial frequency as x1, x2, x3, and x4 respectively. On the basis of observation of training set y1, y2, y3 and y4 is added to the subsequent “x” values. • In some cases, Pitch point number also matters and gets focus to decide the transformed F0 value. • xi and yi, values are came out after the rigorous analysis of pitch patterns of neutral and emotional utterances.
Pitch point1 Range Difference (y) (Hz) +40 +100 +150 Utterance Frequency 82% 10% 8% Pitch point2 Range Difference (y) (Hz) -40 +40 >+100 Utterance Frequency 25% 70% 5% Pitch point3 Range Difference (y) (Hz) -100 +25 >+80 Utterance Frequency 10% 73% 17% Pitch point4 Range Difference (y) (Hz) -10 +40 >+80 Utterance Frequency 17% 55% 28%
Sentence Based Emotion Conversion-Algorithm • Select desired sound wave form • Convert speech waveform in pitch tier • // Stylization • For all Pis • Select Pi, that is more close to straight line and compare with resolution frequency (fr) • if distance between pi and straight line > fr • Stop the process • else • Repeat for other Pis • Divide the pitch points in four groups • For each group • group[1] = x1+y1 || x1-y1 • group[2]= x2+y2+2(pitch point number) • group[3]= x3+y3 • group[4]= x4+y4+3(pitch point number) • Remove existing pitch points • Add newly calculated pitch points in place of old pitch points.
Figure 1. Pitch points for natural neutral emotion Figure 2. Pitch points for natural surprise emotion
Experimental Results • For this process, “कल तुम्हें फाँसी हो जाएगी।“ was considered and their results are given in figure 5 and table 5. • In figure, upper picture shows the natural surprise utterance and lower part displays the transformed neutral to surprise utterance . • Table 5 gives the idea about the conversion algorithm pitch points wise .
Figure 5 Natural and transformed Surprise emotion utterance.