1 / 50

John-Paul Hosom Alexander Kain Akiko Amano-Kusumoto Brian O. Bush

Data-Driven Formant Target Estimation. John-Paul Hosom Alexander Kain Akiko Amano-Kusumoto Brian O. Bush. Center for Spoken Language Understanding (CSLU) Department of Biomedical Engineering (BME) Oregon Health & Science University (OHSU). Outline.

lamont
Download Presentation

John-Paul Hosom Alexander Kain Akiko Amano-Kusumoto Brian O. Bush

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Driven Formant Target Estimation John-Paul Hosom Alexander Kain Akiko Amano-Kusumoto Brian O. Bush Center for Spoken Language Understanding (CSLU)Department of Biomedical Engineering (BME)Oregon Health & Science University (OHSU)

  2. Outline • Introduction (or, “What’s the big picture?”) • Background: Speaking Styles • Background: Features of Speech • Background: Characteristics of Clear Speech • Background: Formant Targets and Locus Theory • Objectives of Current Study • Corpus • Model • Methods • Results, Conclusions, & Future Work

  3. image from http://www.ph.tn.tudelft.nl/~vanwijk/Athens2004/niceones/images.html

  4. 1. Introduction Motivation #1: • Difficult to understand speech in noise, especially with a hearing impairment. • When people speak clearly, speech becomes more intelligible. • Automatic enhancement of speech could be usedin next-generation hearing aids. • Attempts to modify speech by computer to improve intelligibility not yet very successful. • Need to understand which parts of signal should be modified and how to modify them.

  5. 1. Introduction Motivation #2: • Even best current model for computer speech recognition does not provide sufficiently accurate results. • Speech synthesis requires huge amount of data to sound natural. • A better understanding of how acoustic features contribute to speech intelligibility could guide research on improving computer speech recognition and speech synthesis.

  6. 1. Introduction Research Objective: To identify and quantitatively model the relative contribution of acoustic features to intelligibility by examining conversational and clear speech. Long-Term Goals: • Accurately predict speech intelligibility from acoustic features (e.g. for diagnosis of dysarthria), • Integrate most effective features into computer speech recognition and synthesis models, • Develop novel signal-processing algorithms for hearing aids.

  7. 2. Speaking Styles: Production “Conversational speech” and “clear speech” are easily produced with simple instructions to speakers. • Conversational (CNV) speech: “read text conversationally as in daily communication.” • Clear (CLR) speech: “read text clearly as if talking to a hearing-impaired listener.” (Note: The naturalness of CLR and CNV speech obtained by reading is a matter of current debate.)

  8. To compare CNV and CLR speech intelligibility, same sentences read in both styles, then listened to by group of subjects. Intelligibility measured as the percentage of sentences that are correctly recognized by listener. CLR speech increases intelligibility for a variety of: Listeners, (young listeners, elderly listeners) Speech materials, (meaningful sentences, nonsense syllables) Noise conditions. (white noise, multi-talker babble noise) 2. Speaking Styles: Perception

  9. 3. Features of Speech • Acoustic Features • Duration (length of distinct sounds) • Energy • Pitch • Spectrum (spectrogram) • Formants • Residual(spectrum without formants) • Coarticulation (how formants change over time) • Phonetic Features

  10. 0 1K 2K 3K 4K 0 1K 2K 3K 4K 3. Features of Speech: Formants The resonant frequencies, or formants, are clearly different for vowels /ɑ/ and /i/. Spectral envelope is important for phoneme identity (envelope = general spectral shape, no harmonics). envelope /ɑ/ /i/

  11. 3. Features of Speech: Formants • Formants specified by frequency, and numbered in order of increasing frequency. For /ɑ/, average F1=710, F2=1100. • F1, F2, and sometimes F3 often sufficient for identifying vowels.

  12. 3. Features of Speech: Formants Formant frequencies (averages for American English, adult male speakers): [from Ladefoged, p. 193]

  13. 3. Features of Speech: Coarticulation time frequency j r u ɑ frequency “you are”: /j uɑ r/

  14. 3. Features of Speech: Coarticulation time frequency u j r ɑ frequency “you are”: /j u ɑ r/

  15. 3. Features of Speech: Vowel Neutralization When speech is uttered quickly, or is not clearly enunciated, formants shift toward a neutral vowel: [from van Bergem 1993 p. 8]

  16. Outline • Introduction (or, “What’s the big picture?”) • Background: Speaking Styles • Background: Features of Speech • Background: Characteristics of Clear Speech • Background: Formant Targets and Locus Theory • Objectives of Current Study • Corpus • Model • Methods • Results & Conclusion

  17. 4. Characteristics of Clear Speech Many acoustic and phonetic differences between “clear” and “conversational” speech: Prosodic features[e.g.Picheny 1986, Krause 2004] • Fundamental frequency (F0): More variation,higher average. • Consonant energy: Increases for stops (e.g. /p/, /k/) and fricatives (e.g. /s/, /v/). • Phoneme duration: Longer, especially for tense vowels (e.g. /i:/, /ɑ/). • Pauses: Longer in duration and more frequent.

  18. 4. Characteristics of Clear Speech Many acoustic and phonetic differences: Phonological phenomena[Picheny 1986, Krause 2004] • Burst elimination (e.g. cat), degemination (e.g. fasttrain): Occur less often • Alveolar flap (e.g. beauty): Occur less often • Sound insertion: Neutral vowel /ə/: Inserted at the end of words

  19. 4. Characteristics of Clear Speech Many acoustic and phonetic differences: Spectral features • Vowel space: Expanded, especially for lax vowels (e.g. /ɪ/, /Ɛ/) (i.e. less vowel neutralization) [Picheny 1986, Ferguson 2002, 2007, Moon 1994, Bradlow 2003]. • Formant undershoot: Less formant displacement in the context of /w/-/l/ (i.e. less vowel neutralization) [Moon 1994]. • Long-term average spectra: Increased energy in 1000-3150 Hz range [Krause 2004].

  20. 4. Characteristics of Clear Speech clear speech: observed midpoints of vowels conversational speech: observed midpoints of vowels

  21. 4. Characteristics of Clear Speech • Using a “hybridization” algorithm that combined features of CLR and CNV speech and perceptual testing, we have shown over several experiments that the most relevant features for intelligibility are the combination of spectrum and duration. [Kain, Amano-Kusumoto, and Hosom (2008); Kusumoto, Kain, Hosom, and van Santen (2007); Amano-Kusumoto and Hosom (2009)] • This has led us to study a model coarticulation, to quantitatively model the change of formants over time.

  22. Outline • Introduction (or, “What’s the big picture?”) • Background: Speaking Styles • Background: Features of Speech • Background: Characteristics of Clear Speech • Background: Formant Targets and Locus Theory • Objectives of Current Study • Corpus • Model • Methods • Results & Conclusion

  23. 5. Formant Targets and Locus Theory Locus Theory [Delattre, Liberman, and Cooper, 1955]: “There are, for each consonant, characteristic frequency positions … at which the formant transitions begin, or to which they may be assumed to point. On this basis, the transitions may be regarded … as movements of the formants from their respective loci to the frequency levels appropriate for the next phone. … The spectrographic patterns …, which produce /d/ before /iy/, /aa/, and /ow/, show how … these transitions seem to be pointing to a [F2] locus in the vicinity of 1800 [Hz].”

  24. 5. Formant Targets and Locus Theory [From Klatt 1987, p. 753]

  25. 5. Formant Targets and Locus Theory • Consonants and vowels both have “targets” of articulator positions, and therefore formant frequency locations. • Given sufficient duration of a syllable, all phonemes reach their targets. • The slope of formants during a transition from a consonant to a vowel is relatively constant until reaching the target. • If the syllable duration doesn’t allow enough time for the formants to reach their targets, “target undershoot” occurs, and the formants change direction before fully realizing the intended vowel.

  26. 5. Formant Targets and Locus Theory time frequency u j r ɑ frequency “you are”: /j u ɑ r/ (clear speech)

  27. 5. Formant Targets and Locus Theory time /d/ /u/ frequency • Most consonants (all except /j/, /l/, /r/, /w/) do not have visible formants • They have “virtual” formants identified by coarticulation in the vowel.

  28. 5. Formant Targets and Locus Theory [from Delattre et al., 1955 as reported in Johnson, 1997, p. 135]

  29. Original CLR speech: Intelligible Original CNV speech: Less intelligible 5. Formant Targets and Locus Theory Formant Target Frequency Frequency Formant Undershoot /w/ /l/ /i:/ /i:/ /w/ /l/ Modified speech (synthetic) with only duration orformants modified: intelligibility between CNV and CLR Frequency Frequency /w/ /i:/ /l/ /i:/ /w/ /l/

  30. 5. Formant Targets and Locus Theory Summary 1: • Vowels and consonants have formant targets;most consonants have “virtual” formants. • Coarticulation yields smooth change between targets when formants are visible. • If duration is too short, formants do not reachtheir targets, yielding undershoot. • Both the targets and the rate of change are important for intelligibility.

  31. 5. Formant Targets and Locus Theory Summary 2: • Vowels may not reach their targets. Target estimation is difficult for vowels unless vowel of sufficient duration and specific CVC context is analyzed. • Because many consonants are produced without vibration of vocal folds, they do not have formants. Target estimation is difficult for consonants because there is no data in the consonant region with which to estimate target values. • Without knowing targets, it is difficult to quantitatively model dynamics of speech

  32. Outline • Introduction (or, “What’s the big picture?”) • Background: Speaking Styles • Background: Features of Speech • Background: Characteristics of Clear Speech • Background: Formant Targets and Locus Theory • Objectives of Current Study • Corpus • Model • Methods • Results & Conclusion

  33. 6. Objectives of Current Study Objectives of Current Study: • Develop a quantitative model of formant dynamics with a good fit to observed data (both “clear” and “conversational” speech). • Reliably estimate the parameters of this model. • Use estimated target values to improve automatic speech recognition. (Estimated values could also be used to improve text-to-speech synthesis, diagnosis of dysarthria, or other applications.)

  34. 7. Corpus Corpus: • Male and female speaker • Sentences contain a neutral carrier phrase (5 total) followed by a target word (242 total) • Target words are common English CVC words with 23 initial and final consonants and 8 vowels. • All sentences spoken in both clear and conversational styles. • Two recordings per style of each sentence. • Formants and phoneme boundaries automatically estimated, manually corrected with verification.

  35. 8. Model Formant trajectory model: • is the estimated formant trajectory over time t. • TC1, TV, TC2 are target formant values for C1, V, C2. • is the degree of articulation of C1 or C2 • is a sigmoid function over time t. • s is maximum slope of , p is position of s.

  36. 8. Model Formant trajectory model:

  37. 8. Model Formant trajectory model:

  38. Outline • Introduction (or, “What’s the big picture?”) • Background: Speaking Styles • Background: Features of Speech • Background: Characteristics of Clear Speech • Background: Formant Targets and Locus Theory • Objectives of Current Study • Corpus • Model • Methods • Results & Conclusion

  39. 9. Methods Estimating Model Parameters: • Two sets of parameters to estimate:(a)s1, s2, p1, p2 estimated on a per-token basis(b)TC1, TV, TC2 estimated on a per-token basis (independent of speaking style) and then averaged • For one token, error is: • Genetic algorithm used to find best estimates.Fitness function is error summed over all tokens.

  40. 9. Methods Estimating Model Parameters: • Genetic algorithm employs mutation, crossover (exchange of group of parameters), and elitism (best solutions retained in next generation). • Data partitioned into 20 folds for n-fold validation. • Performed 60 randomly-initialized starts in parameter estimation to get 60 points per phoneme.

  41. 10. Results Target Estimation Results: Estimated targets for vowels (60 points per phoneme)

  42. 10. Results Target Estimation Results: Estimated targets for C1 (60 points per phoneme)

  43. 10. Results Target Estimation Results: For one fold, mean and standard deviation for each phoneme

  44. 10. Results Target Estimation Results: • Targets cluster well, even with random initialization • Targets often fit our expectations based on knowledge of acoustic-phonetics:▸ Bilabials consistently clustered around 1200 Hz▸ F2 for alveolars /t/ and /d/ consistently around 1700 Hz.▸ Voiced and unvoiced phonemes located in same regions of F2, close in F1 ▸ Approximants and vowels tightly clustered at expected locations

  45. 10. Results Target Estimation Results: • However, /s/, /z/, /n/ lower in F2 than expected at 1500 Hz. • Variance in targets of individual tokens, but little overlap between vowels. • In general, clustering of targets and locations of targets indicate the usefulness of this method.

  46. 10. Results Coarticulation Parameter Results: • Error surface between estimated model and observed data as a function of s and p: • A low error can be obtained for many values of s.

  47. 10. Results Coarticulation Parameter Results: • As a result, s shows differences in mean for different consonants, but high variance: • Therefore, s values do not cluster well, and • s values can not be reliably extracted for a single CVC token. Mean and standard deviation of second-formant s1 values for 10 phonemes

  48. 11. Conclusions Conclusions: • Estimation of consonant and vowel targets can be performed reliably when estimating over a large number of CVCs. • Estimation of coarticulation parameter s can not be performed reliably (yet) for a single CVC. • Therefore, formant targets can not (yet) be reliably estimated for a single token, which is necessary to apply this work to automatic speech recognition.

  49. 11. Conclusions Future Work: • Determine and apply constraints on s, so that coarticulation parameters and formant targets can be reliably estimated for a single token. • Given estimated targets for a CVC, estimate the probability of these targets given each phoneme:p(target | phoneme) • Use these probabilities instead of the probabilities currently used in speech recognition,p(observed data at 10-msec frame | phoneme) • Expand to recognize arbitrary length phoneme sequence and use non-formant features.

  50. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant IIS-091575. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

More Related