1 / 52

John-Paul Hosom Alexander Kain Akiko Kusumoto

Determining Which Acoustic Features Contribute Most to Speech Intelligibility. John-Paul Hosom Alexander Kain Akiko Kusumoto. hosom@cslu.ogi.edu. Center for Spoken Language Understanding (CSLU) OGI School of Science & Engineering Oregon Health & Science University (OHSU).

xandy
Download Presentation

John-Paul Hosom Alexander Kain Akiko Kusumoto

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Determining Which Acoustic Features Contribute Most to Speech Intelligibility John-Paul Hosom Alexander Kain Akiko Kusumoto hosom@cslu.ogi.edu Center for Spoken Language Understanding (CSLU)OGI School of Science & Engineering Oregon Health & Science University (OHSU)

  2. image from http://www.ph.tn.tudelft.nl/~vanwijk/Athens2004/niceones/images.html

  3. Outline • Introduction • Background: Speaking Styles • Background: Acoustic Features • Background: Prior Work on Clear Speech • Objectives of Current Study • Methods • Results • Conclusion

  4. 1. Introduction Motivation #1: • Difficult to understand speech in noise, especially with a hearing impairment. • When people speak clearly, speech becomes more intelligible. • Automatic enhancement of speech could be usedin next-generation hearing aids. • Attempts to modify speech by computer to improve intelligibility not yet very successful. • Need to understand which parts of signal should be modified and how to modify them.

  5. 1. Introduction Motivation #2: • Even best current model for computer speech recognition does not provide sufficiently accurate results. • Current research applies new mathematical techniques to this model, but techniques are generally not motivated by studies of human speech perception. • A better understanding of how acoustic features contribute to speech intelligibility could guide research on improving computer speech recognition.

  6. 1. Introduction Research Objective: To identify the relative contribution of acoustic features to intelligibility by examining conversational and clear speech. Long-Term Goals: • Accurately predict speech intelligibility from acoustic features, • Integrate most effective features into computer speech-recognition models, • Develop novel signal-processing algorithms for hearing aids.

  7. 2. Speaking Styles: Production “Conversational speech” and “clear speech” easily produced with simple instructions to speakers. • Conversational (CNV) speech: “read text conversationally as in daily communication.” • Clear (CLR) speech: “read text clearly as if talking to a hearing-impaired listener.”

  8. To compare CNV and CLR speech intelligibility, same sentences read in both styles, then listened to by group of subjects. Intelligibility measured as the percentage of sentences that are correctly recognized by listener. CLR speech increases intelligibility for a variety of: Listeners, (young listeners, elderly listeners) Speech materials, (meaningful sentences, nonsense syllables) Noise conditions. (white noise, multi-talker babble noise) 2. Speaking Styles: Perception

  9. Outline • Introduction • Background: Speaking Styles • Background: Acoustic Features • Background: Prior Work on Clear Speech • Objectives of Current Study • Methods • Results • Conclusion

  10. 3. Acoustic Features: Representations • Acoustic Features • Duration(length of each distinct sound) • Energy • Pitch • Spectrum (spectrogram) • Formants • Residual(power spectrum without formants)

  11. 3. Acoustic Features: Waveform • Time-Domain Waveform • “Sound originates from the motion or vibration of an object. This motion is impressed upon the [air] as a pattern of changes in pressure.” • [Moore, p. 2] amplitude time (msec) waveform for word “two”

  12. 3. Acoustic Features: Energy • Energy • Energy is proportional to the square of the pressure variation. Log scale is used to reflect • human perception. xn = waveform sample x at time point n N = number of time samples waveform energy

  13. nasal tract vocal tract tongue vocal folds(larynx) 3. Acoustic Features: Pitch • Pitch • Pitch (F0) is rate of vibration of vocal folds. amplitude time Airflow through vocal folds Speech Production Apparatus (from Olive, p. 23)

  14. 3. Acoustic Features: Pitch • Pitch • Pitch (F0) is rate of vibration of vocal folds. time (msec) amplitude Pitch=117 Hz Pitch=83 Hz

  15. 3. Acoustic Features: Spectrum • Phoneme: • Abstract representation of basic unit of speech (“cat”: /kqt/). • Spectrum: • What makes one phoneme, /e/, sound different from another phoneme, /i/? • Different shapes of the vocal tract: /E/ is produced with the tongue low and in the back of the mouth; • /i/ with tongue high and toward the front.

  16. 3. Acoustic Features: Spectrum • Source of speech is pulses of air from vocal folds. • This source is filtered by vocal tract “tube”. • Speech waveform is result of filtered source signal. • Different shapes of tube create different filters, different resonant frequencies, different phonemes. /e/ /i/ (from Ladefoged, p. 58-59)

  17. 3. Acoustic Features: Spectrum Resonant frequencies identified by frequency analysis of speech signal. Fourier Transform expresses a signal in terms of signal strength at different frequencies:

  18. time- domain amplitude 90 spectral power (dB) 10 0 Hz frequency (Hz) 4000 Hz 3. Acoustic Features: Spectrum The time-domain waveform and power spectrum can be plotted like this (/e/):

  19. time- domain amplitude 90 spectral power (dB) 10 0 Hz frequency (Hz) 4000 Hz 3. Acoustic Features: Spectrum The time-domain waveform and power spectrum can be plotted like this (/e/): F0=95 Hz

  20. 0 1K 2K 3K 4K 0 1K 2K 3K 4K 3. Acoustic Features: Spectrum The resonant frequencies, or formants, are clearly different for vowels /e/ and /i/. Spectral envelope is important for phoneme identity (envelope = general spectral shape, no harmonics). envelope /e/ /i/

  21. 0 1K 2K 3K 4kHz 3. Acoustic Features: Formants Formants (dependent on vocal-tract shape) are independent of pitch (rate of vocal-fold vibration). /e/ F0=80 Hz /e/ F0=160 Hz

  22. 3. Acoustic Features: Formants • Formants specified by frequency, and numbered in order of increasing frequency. For /e/, F1=710, F2=1100. • F1, F2, and sometimes F3 often sufficient for identifying vowels. • For vowels, sound source is air pushed through vibrating vocal folds. Source waveform is filtered by vocal-tract shape. Formants correspond to these filters. • Digital model of a formant can be implemented using an infinite-impulse response (IIR) filter.

  23. 3. Acoustic Features: Formants Formant frequencies (averages for English): (from Ladefoged, p. 193)

  24. 3. Acoustic Features: Coarticulation time frequency j r u e frequency “you are”: /j ue r/

  25. 3. Acoustic Features: Coarticulation time frequency u j r e frequency “you are”: /j u e r/

  26. 3. Acoustic Features: Coarticulation time frequency

  27. 3. Acoustic Features: Vowel Neutralization When speech is uttered quickly, or is not clearly enunciated, formants shift toward neutral vowel: (from van Bergem 1993 p. 8)

  28. Outline • Introduction • Background: Speaking Styles • Background: Acoustic Features • Background: Prior Work on Clear Speech • Objectives of Current Study • Methods • Results • Conclusion

  29. 4. Prior Work: Acoustics of Clear Speech • Pitch (F0): more variation, higher average. • Energy: Consonant-vowel (CV) energy ratio increases for stops (/p/, /t/, /k/, /b/, /d/, /g/). • Pauses: Longer in duration and more frequent. • Phoneme and sentence duration: longer. • However, correlation between a characteristic of an acoustic feature and intelligibility does not mean the characteristic causes increased intelligibility. • For example, fast speech can be just as intelligible as slow speech; longer sentence duration not a cause of increased intelligibility.

  30. 4. Prior Work: Speech Modification • Lengthen phoneme durations[e.g. Uchanski 1996] • Insert pauses at phrase boundaries or word boundaries [e.g. Gordon-Salant 1997; Liu 2006]. • Amplify consonant energy in consonant-vowel (CV) contexts [Gordon-Salant, 1986; Hazan, 1998]. Positive results at sentence level reported in only one case, using extreme modification. (Hazan 1998, 4.2% improvement)

  31. 5. Objectives: Background • Summary of Current State: • CLR speech intelligibility higher than CNV speech. • Speech has acoustic features that interact in complex ways. • Correlation between acoustic features and intelligibility has been shown, but causation not demonstrated. • Signal modification of CNV speech shows little or no intelligibility improvement. • Reason for inability to dramatically improve CNV speech intelligibility not known.

  32. 5. Objectives of Current Study • Objectives of Current Study: • To validate that CLR speech is more intelligible than CNV speech for our speech material, • To process CNV speech so that intelligibility is significantly closer to CLR speech, • We propose a hybridization algorithm that creates “hybrid” (HYB) speech using features from both CNV and CLR speech • To determine acoustic features of CLR speech that cause increased intelligibility.

  33. Outline • Introduction • Background: Speaking Styles • Background: Acoustic Features • Background: Prior Work on Clear Speech • Objectives of Current Study • Methods • Results • Conclusion

  34. 6. Methods: Hybridization Algorithm • Hybridization: • Input: parallel recordings of a sentence spoken in both CNV and CLR styles. • Signal processing replaces certain acoustic features from CNV speech with those of CLR speech. • Output: synthetic speech signal. • Uses Pitch-Synchronous Overlap Add (PSOLA) for pitch and/or duration modification. [Moulines and Charpentier, 1990].

  35. a b d c a b c Original Signal 33ms a a b b d c c d a c a c b b Modified Signal 66ms 25ms 6. Methods: Hybridization with PSOLA Original CNV speech Duration Modification duplicate or eliminate glottal pulses Pitch Modification alter distance between glottal pulses scale 2.0  raise pitch scale 2.0  lengthen duration

  36. 6. Methods: Hybridization Algorithm CNV Speech CLR Speech Phoneme Labelling Phoneme Labelling Stage 1: Database Preparation Voicing Voicing Pitch marking Pitch marking Placement of Auxiliary Marks Placement of Auxiliary Marks Phoneme Alignment between CLR and CNV Speech Parallelization between CLR and CNV Speech (features P, N) Hybrid Configuration F0 (F) F0 (F) Stages 2 and 3: Feature Analysis and Selection Long-term Energy (E) Long-term Energy (E) Phoneme Duration (D) Phoneme Duration (D) Spectrum (S) Spectrum (S) Stage 4: Waveform Synthesis Pitch Synchronous Overlap Add (PSOLA) Output: HYB Speech Stimuli: CLR-D

  37. 6. Methods: Hybridization Algorithm CNV Speech CLR Speech Phoneme Labelling Phoneme Labelling Stage 1: Database Preparation Voicing Voicing Pitch marking Pitch marking Placement of Auxiliary Marks Placement of Auxiliary Marks Phoneme Alignment between CLR and CNV Speech Parallelization between CLR and CNV Speech (features P, N) • For each sentence (CLR and CNV recordings): • Manually label phoneme identity and locations. • Match phonemes in CLR and CNV recordings. • Identify location of each glottal pulse.

  38. 6. Methods: Hybridization Algorithm • Extract acoustic features: Spectrum (S), F0 (F), Energy (E), Duration (D) • For each feature, select from CNV or CLR for generating speech waveform. CLR CNV Hybrid Configuration F0 (F) F0 (F) Stages 2 and 3: Feature Analysis and Selection Long-term Energy (E) Long-term Energy (E) Phoneme Duration (D) Phoneme Duration (D) Spectrum (S) Spectrum (S) Stage 4: Waveform Synthesis Pitch Synchronous Overlap Add (PSOLA) Output: HYB Speech Stimuli: CLR-D

  39. 6. Methods: Hybridization Algorithm • Use PSOLA to generate waveform using selectedfeatures with spectrum at each glottal pulse. • Output is HYB speech, named according to features taken from CLR speech, e.g. CLR-D. CLR CNV Hybrid Configuration F0 (F) F0 (F) Stages 2 and 3: Feature Analysis and Selection Long-term Energy (E) Long-term Energy (E) Phoneme Duration (D) Phoneme Duration (D) Spectrum (S) Spectrum (S) Stage 4: Waveform Synthesis Pitch Synchronous Overlap Add (PSOLA) Stimulus: CLR-D

  40. 6. Methods: Speech Corpus • Public database of sentences, syntactically and semantically valid. • Ex: His shirt was clean but onebutton was gone. • 5 keywords (underlined) for measuring intelligibility. • Long enough to test effects of prosodic features (combination of duration, energy, pitch). • Short enough to minimize memory effects. • One male speaker read text material with both CNV and CLR speaking styles.

  41. 6. Methods: Perceptual Test • For each listener: • Audiometric test (to ensure normal hearing), • Find optimal noise level for this listener, • Measure intelligibility of CLR, CNV, and HYB speech. • For finding optimal noise levels and measuring intelligibility, the listener’s task is to repeat the sentence aloud.

  42. 6. Methods: Finding Optimal Noise Level • Total energy of each sentence normalized (65 dBA). • To avoid “ceiling effect,” sentences played with background noise (12-speaker babble noise). • To normalize performance differences between listeners, noise set to a specific level for each listener. • Noise level set so that each listener correctly identifies CNV sentences 50% of the time. decreasing noise level

  43. # of sentences correctly identified Intelligibility (%) = x100 # of sentences presented 6. Methods: Measuring Intelligibility • 48 sentences per subject • Correct response for sentence when at least 4 of 5 keywords correctly repeated by listener.

  44. 6. Methods: Listeners • Subjects: • 12 listeners with normal hearing • age 19 – 40 (mean 29.17) • Average noise level -0.24 dB SNR • Significance Testing • Paired t-test with p < 0.05

  45. 6. Methods: Features • Energy and pitch always taken from CNV speech. • Test importance of other two acoustic features: • spectrum (for phoneme identity) • duration (for syntactic parsing) • Test co-dependence of spectrum and duration. Speech Waveform Spectrum Prosody Residual Formants Duration Energy Pitch

  46. 6. Methods: Stimuli • Conditions: • CNV Original • HYB Speech, CLR-Dur • HYB Speech, CLR-Spec • HYB Speech, CLR-DurSpec • CLR Original Speech Waveform Spectrum Prosody Residual Formants Duration Energy Pitch

  47. Outline • Introduction • Background: Speaking Styles • Background: Acoustic Features • Background: Prior Work on Clear Speech • Objectives of Current Study • Methods • Results • Conclusion

  48. * * 100 * 80 Mean Intelligibility (%) 60 40 20 CLR CNV CLR-Dur CLR-Spec CLR-DurSpec 7. Results • 10% difference between CNV and CLR-Dur • 11% difference between CNV and CLR-Spec • 18% difference between CNV and CLR-DurSpec • 25% difference between CNV and CLR 89 * = significant difference, compared with CNV 82 75 74 64

  49. 8. Conclusion Results of Objectives: 1. To validate that CLR speech is more intelligible than CNV speech,Confirmed: 25% absolute difference (significant). 2. To process CNV speech so that intelligibility is significantly closer to CLR speech, Confirmed: 18% absolute improvement (significant). 3. To determine acoustic features of CLR speech that cause increased intelligibility. Spectrum and combination of Spectrum and Duration are effective. Duration alone almostsignificant.

  50. 8. Conclusion Conclusions: 1. The single acoustic feature that yields greatest intelligibility improvement is the spectrum, but it contributes less than half of possible improvement. 2. Duration alone yields improvements almost as good as spectrum alone. (Prior work indicates, however, that total sentence duration and pause patterns are not important for intelligibility.) 3. The combination of duration and spectrum does not quite yield the intelligibility of CLR speech; further work to determine if difference due to (a) pitch, (b) energy, (c) signal-processing artifacts.

More Related