Models of speech dynamics for ASR, using intermediate linear representations

Models of speech dynamics for ASR, using intermediate linear representations Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/

INTRODUCTION Abstract

INTRODUCTION Speech dynamics into ASR • dynamics of speech production to constrain recognizer • noisy environments • conversational speech • speaker adaptation • efficient, complete and trainable models • for recognition • for analysis • for synthesis

INTRODUCTION Articulatory trajectories from West (2000)

INTRODUCTION Articulatory-trajectory model

INTRODUCTION Articulatory-trajectory model Level surface source dependent intermediate finite-state

INTRODUCTION Multi-level Segmental HMM • segmental finite-state process • intermediate “articulatory” layer • linear trajectories • mapping required • linear transformation • radial basis function network

INTRODUCTION Linear-trajectory model acoustic layer articulatory-to-acoustic mapping intermediate layer segmental HMM 1 2 3 4 5

THEORY Linear-trajectory equations Defined as where Segment probability:

THEORY Linear mapping Objective function with matched sequences and

THEORY Trajectory parameters Utterance probability, and, for the optimal (ML) state sequence

THEORY Non-linear (RBF) mapping acoustic layer . . . . . . . . . formant trajectories

THEORY Trajectory parameters With the RBF, the least-squares solution is sought by gradient descent:

METHOD Tests on TIMIT • N. American English, at 8kHz • MFCC13 acoustic features (incl. zero’th) • F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker • F1-3+BE5: five band energies added • PFS12: synthesiser control parameters

RESULTS TIMIT baseline performance • Constant-trajectory SHMM (ID_0) • Linear-trajectory SHMM (ID_1)

RESULTS Performance across feature sets

METHOD Phone categorisation

METHOD Discrete articulatory regions

RESULTS Performance across groupings

RESULTS Results across groupings

METHOD Tests on MOCHA • S. British English, at 16kHz • MFCC13 acoustic features (incl. zero’th) • articulatory x- & y-coords from 7 EMA coils • PCA9+Lx: first nine articulatory modes plus the laryngograph log energy

RESULTS MOCHA baseline performance

RESULTS Performance across mappings

DISCUSSION Model visualisation Original acoustic data Constant- trajectory model Linear- trajectory model, (F) PFS12 (c)

SUMMARY Conclusions • Theory of Multi-level Segmental HMMs • Benefits of linear trajectories • Results show near optimal performance with linear mappings • Progress towards unified models of the speech production process • What next? • unsupervised (embedded) training, to derive pseudo-articulatory representations • implement non-linear mapping (i.e., RBF) • include biphone language model, and segment duration models

Models of speech dynamics for ASR, using intermediate linear representations

Models of speech dynamics for ASR, using intermediate linear representations

Presentation Transcript

Speech Dynamics

Intermediate Representations

Representations / Models

Automatic Speech Recognition (ASR)

2.4 Using Linear Models

SP1. Using Representations and Models

Representations of speech

Representations, Models, Diagrams…

Chap. 10, Intermediate Representations

Intermediate Representations (Chapter 4)

Intermediate Representations

2.4 Using Linear Models

MODELS OF INTERMEDIATE CARE

Structurally Discriminative Graphical Models for ASR

Speech Enhancement for ASR

2-5 Using linear models

Describing the regional dynamics of plant populations using models of intermediate complexity

2.5 Using Linear Models

Intermediate Code Representations

Using Linear Models

Lecture 12: Intermediate Representations

Speech Signal Representations