1 / 11

Creating Acoustic/Articulatory/Prosodic Feature Templates for Words

This project aims to create phonetic word templates that capture critical acoustic, articulatory, and prosodic features for accurate word recognition. By identifying key features and their representation in syllable structures, we seek to improve current recognition systems.

Download Presentation

Creating Acoustic/Articulatory/Prosodic Feature Templates for Words

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating Acoustic/Articulatory/Prosodic Feature Templates for Words Vidya Mohan Center for Speech and Language Engineering The Johns Hopkins University

  2. Introduction Motivation • A significant portion (40-50%) of the WER on Switchboard, even in the present day recognition systems, can be attributed to substitution errors (Greenberg,Chang ICSA 2000) • “The relatively low tolerance for phone misclassification implies that the pronunciation and language models possess only a limited capability to compensate for errors at the phonetic segment level” • “There are approximately three times the number of articulatory-feature (AF) errors (per phone) when the word is misrecognized, regardless of the position of the segment within the word.” • Building better acoustic models can possibly help to decrease these errors as they can model fine-grained changes in the speech signal better than phones.

  3. Introduction (cont’d) • In order to ascertain this, what are the dynamics of a word w.r.t. to its acoustic/articulatory/prosodic features? • What is a sufficient feature-set representations that can facilitate for its recognition? Problem Statement • Create phonetic word templates for words This would consist of key phonetic features that define some characteristics of the word • Identify which phonetic features are critical in the recognition of the word • Use this template to identify a word’s occurrence in an utterance.

  4. Features • We are interested in features that preserve as much information of the speech waveform as possible. • Primary features of interest are: • Acoustic • Articulatory • Prosodic • Articulatory Features • Manner - Fricative, spirant, stop, nasal, flap, lateral, rhotic, glide, vowel, diphthongs • Place - anterior, central, posterior, back, front, tense • Voicing • Lip rounding • Prosodic Features • Prosody and stress accent - some syllables are more stressed than others

  5. Features (cont’d) • Acoustic Features • Knowledge Based (formant) Acoustic Parameters • Neural Firing Rate Features (rate scale) • Energy level and modulation • Duration - of words, syllables and constituents • Other Features • Number of syllables • Sensitivity to context Summary • Which sub space of the multi-dimensional feature space acoustic/articulatory/prosodic space does a particular word occupy? • Which of the above features are most representative of that word?

  6. Representation • To represent a word in terms of its acoustic/articulatory/prosodic features, a syllable structure along with its constituent components, is an efficient representation as it captures the dynamics of the speech signal well. • “…since a triphone unit spans an extremely short time-interval, such a unit is not is not suitable for integration of spectral and temporal dependencies.” (Ganapathiraju, Goel et.al. IEEE ASRU ‘97)

  7. Representation (cont’d) Summary • Thus a typical representation of a word, say, center, would be,

  8. Methodology • Choosing the word • Common confusable words from Switchboard • Broadly Three Step Processs • Feature Detection • Word Template Creation • Evaluation • Feature Detection • Use the phonetic classifiers to determine the various features present in a particular signal • Segment the signal using a syllable classifier • Stress Accent Detector for Vowels (accuracy of 76% on manually transcribed corpus of Switchboard) • Word Template Creation • The function should capture the information content of a particular feature of a word • The greater the information content, the more it is characteristic of that word • Mutual Information of a word and a feature captures the notion of information content quantitatively • I(W, F) = H(W) - H(W/F = ∑ ∑ p(w, f) log[p(w/f)/p(w)]

  9. Methodology (cont’d) • Evaluation - Word Identification • Use this template to find the temporal bounds of a particular word in an utterance • Parameter of Evaluation: Number of utterances where the word was accurately detected • Utterances are chosen from the TIMIT and Switchboard corpus • Ultimately hope to develop detectors that could be used to classify word confusion pairs that exist in a lattice. Summary (Side Note - am planning to put a picture of a spectogram, show syllable segmentation and what the objective would be)

  10. Summary Collaborators Sanjeev Khudanpur, JHU Mark Hasegawa Johnson, University of Illinois Steve Greenberg, University of California, Berkeley Summary • The objective of this proposal is to obtain a minimal set representation of a word wrt to its acoustic/articulatory/prosodic features using entropy to choose the features. • Phonetic classifiers developed/tuned this summer will be used for this purpose. • Initial word identification experiments will be used to test the proof of concept. • Given time and efficacy of the method, integrate it with the current LVSR to discriminate between confusable pairs of words

More Related