Creating Acoustic/Articulatory/Prosodic Feature Templates for Words

Creating Acoustic/Articulatory/Prosodic Feature Templates for Words Vidya Mohan Center for Speech and Language Engineering The Johns Hopkins University

Introduction Motivation • A significant portion (40-50%) of the WER on Switchboard, even in the present day recognition systems, can be attributed to substitution errors (Greenberg,Chang ICSA 2000) • “The relatively low tolerance for phone misclassification implies that the pronunciation and language models possess only a limited capability to compensate for errors at the phonetic segment level” • “There are approximately three times the number of articulatory-feature (AF) errors (per phone) when the word is misrecognized, regardless of the position of the segment within the word.” • Building better acoustic models can possibly help to decrease these errors as they can model fine-grained changes in the speech signal better than phones.

Introduction (cont’d) • In order to ascertain this, what are the dynamics of a word w.r.t. to its acoustic/articulatory/prosodic features? • What is a sufficient feature-set representations that can facilitate for its recognition? Problem Statement • Create phonetic word templates for words This would consist of key phonetic features that define some characteristics of the word • Identify which phonetic features are critical in the recognition of the word • Use this template to identify a word’s occurrence in an utterance.

Features • We are interested in features that preserve as much information of the speech waveform as possible. • Primary features of interest are: • Acoustic • Articulatory • Prosodic • Articulatory Features • Manner - Fricative, spirant, stop, nasal, flap, lateral, rhotic, glide, vowel, diphthongs • Place - anterior, central, posterior, back, front, tense • Voicing • Lip rounding • Prosodic Features • Prosody and stress accent - some syllables are more stressed than others

Features (cont’d) • Acoustic Features • Knowledge Based (formant) Acoustic Parameters • Neural Firing Rate Features (rate scale) • Energy level and modulation • Duration - of words, syllables and constituents • Other Features • Number of syllables • Sensitivity to context Summary • Which sub space of the multi-dimensional feature space acoustic/articulatory/prosodic space does a particular word occupy? • Which of the above features are most representative of that word?

Representation • To represent a word in terms of its acoustic/articulatory/prosodic features, a syllable structure along with its constituent components, is an efficient representation as it captures the dynamics of the speech signal well. • “…since a triphone unit spans an extremely short time-interval, such a unit is not is not suitable for integration of spectral and temporal dependencies.” (Ganapathiraju, Goel et.al. IEEE ASRU ‘97)

Representation (cont’d) Summary • Thus a typical representation of a word, say, center, would be,

Methodology • Choosing the word • Common confusable words from Switchboard • Broadly Three Step Processs • Feature Detection • Word Template Creation • Evaluation • Feature Detection • Use the phonetic classifiers to determine the various features present in a particular signal • Segment the signal using a syllable classifier • Stress Accent Detector for Vowels (accuracy of 76% on manually transcribed corpus of Switchboard) • Word Template Creation • The function should capture the information content of a particular feature of a word • The greater the information content, the more it is characteristic of that word • Mutual Information of a word and a feature captures the notion of information content quantitatively • I(W, F) = H(W) - H(W/F = ∑ ∑ p(w, f) log[p(w/f)/p(w)]

Methodology (cont’d) • Evaluation - Word Identification • Use this template to find the temporal bounds of a particular word in an utterance • Parameter of Evaluation: Number of utterances where the word was accurately detected • Utterances are chosen from the TIMIT and Switchboard corpus • Ultimately hope to develop detectors that could be used to classify word confusion pairs that exist in a lattice. Summary (Side Note - am planning to put a picture of a spectogram, show syllable segmentation and what the objective would be)

Summary Collaborators Sanjeev Khudanpur, JHU Mark Hasegawa Johnson, University of Illinois Steve Greenberg, University of California, Berkeley Summary • The objective of this proposal is to obtain a minimal set representation of a word wrt to its acoustic/articulatory/prosodic features using entropy to choose the features. • Phonetic classifiers developed/tuned this summer will be used for this purpose. • Initial word identification experiments will be used to test the proof of concept. • Given time and efficacy of the method, integrate it with the current LVSR to discriminate between confusable pairs of words

Creating Acoustic/Articulatory/Prosodic Feature Templates for Words