250 likes | 384 Views
Models of speech dynamics for ASR, using intermediate linear representations. Philip Jackson, Boon-Hooi Lo and Martin Russell. Electronic Electrical and Computer Engineering. http://web.bham.ac.uk/p.jackson/balthasar/. INTRODUCTION. Abstract. INTRODUCTION. Speech dynamics into ASR.
E N D
Models of speech dynamics for ASR, using intermediate linear representations Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/
INTRODUCTION Abstract
INTRODUCTION Speech dynamics into ASR • dynamics of speech production to constrain recognizer • noisy environments • conversational speech • speaker adaptation • efficient, complete and trainable models • for recognition • for analysis • for synthesis
INTRODUCTION Articulatory trajectories from West (2000)
INTRODUCTION Articulatory-trajectory model
INTRODUCTION Articulatory-trajectory model Level surface source dependent intermediate finite-state
INTRODUCTION Multi-level Segmental HMM • segmental finite-state process • intermediate “articulatory” layer • linear trajectories • mapping required • linear transformation • radial basis function network
INTRODUCTION Linear-trajectory model acoustic layer articulatory-to-acoustic mapping intermediate layer segmental HMM 1 2 3 4 5
THEORY Linear-trajectory equations Defined as where Segment probability:
THEORY Linear mapping Objective function with matched sequences and
THEORY Trajectory parameters Utterance probability, and, for the optimal (ML) state sequence
THEORY Non-linear (RBF) mapping acoustic layer . . . . . . . . . formant trajectories
THEORY Trajectory parameters With the RBF, the least-squares solution is sought by gradient descent:
METHOD Tests on TIMIT • N. American English, at 8kHz • MFCC13 acoustic features (incl. zero’th) • F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker • F1-3+BE5: five band energies added • PFS12: synthesiser control parameters
RESULTS TIMIT baseline performance • Constant-trajectory SHMM (ID_0) • Linear-trajectory SHMM (ID_1)
RESULTS Performance across feature sets
METHOD Phone categorisation
METHOD Discrete articulatory regions
RESULTS Performance across groupings
RESULTS Results across groupings
METHOD Tests on MOCHA • S. British English, at 16kHz • MFCC13 acoustic features (incl. zero’th) • articulatory x- & y-coords from 7 EMA coils • PCA9+Lx: first nine articulatory modes plus the laryngograph log energy
RESULTS MOCHA baseline performance
RESULTS Performance across mappings
DISCUSSION Model visualisation Original acoustic data Constant- trajectory model Linear- trajectory model, (F) PFS12 (c)
SUMMARY Conclusions • Theory of Multi-level Segmental HMMs • Benefits of linear trajectories • Results show near optimal performance with linear mappings • Progress towards unified models of the speech production process • What next? • unsupervised (embedded) training, to derive pseudo-articulatory representations • implement non-linear mapping (i.e., RBF) • include biphone language model, and segment duration models