480 likes | 665 Views
Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition. Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture. Overview. Hidden Markov models (HMMs): advantages and limitations Overcoming limitations with segment-based HMMs
E N D
Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture
Overview • Hidden Markov models (HMMs): advantages and limitations • Overcoming limitations with segment-based HMMs • Modelling trajectories of acoustic features • Theory of trajectory-based segmental HMMs • Experimental investigations: comparing performance of different segmental HMMs • Choice of parameters for trajectory modelling: recognition using formant trajectories • A “unified” model for both recognition and synthesis • Challenges and further issues
Typical speech spectral characteristics s i k s th r ee o ne • Each sound has particular spectral characteristics. • Characteristics change continuously with time. • Patterns of change give cues to phone identity. • Spectrum includes speaker identity information.
Useful properties of HMMs 1. Appropriate general structure • Underlying Markov process allows for time-varying nature of utterances. • Probability distributions associated with states represent short-term spectral variability. • Can incorporate speech knowledge - e.g. context-dependent models, choice of features. 2. Tractable mathematical framework • Algorithms for automatically training model parameters from natural speech data. • Straightforward recognition algorithms.
observations model time t time t+1 time t+2 Modelling observations with an HMM
Conventional HMM assumptions • Piece-wise stationarity Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states. • Independence Assumption Probability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence. • Duration model State duration conforms to geometric pdf (given by self-loop transition probability).
Limitations of HMM assumptions • Speech production is not a piece-wise stationary process, but a continuous one. • Changes are mostly smoothly time varying. • Constraints of articulation are such that any one frame of speech is highly correlated with previous and following frames. • Time derivatives capture correlation to some extent - but not within the model. • Long-term correlations, e.g. speaker identity. • Speech sounds have a typical duration, with shorter and longer durations being less likely, and limitations on maximum duration.
Addressing HMM limitations AIMS WERE TO: • retain advantages of HMMs: • automatic and tractable algorithms for training to model quantity of speech data; • manageable recognition algorithms (principle of dynamic programming). • improve the underlying model structure to address HMM shortcomings as models of speech. ACHIEVING THE AIMS: • Associate states with sequencesof feature vectors => SEGMENTAL HMMS
time t (d=3) time t+3 (d=2) time t+5 (d=5) Modelling observations with Segmental HMMs
Segmental HMMs • Associate states with sequencesof feature vectors, where these sequences can vary in duration. • Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones). • Can easily incorporate realistic duration model. • Enable relationship between frames comprising a segment to be modelled explicitly. • Characterize dynamic behaviour during a segment.
time 1 2 3 4 5 6 7 Recognition calculations with HMMs • Compute most likely path through model (or sequence of models). • Evaluate efficiently using dynamic programming (Viterbi algorithm). • To compute probability of emitting observations up to a given frame time, for any one state need only consider states which could be occupied at previous frame.
time 1 2 3 4 5 6 7 Segmental HMM recognition calculation • Principle of dynamic programming still applies. • BUT, is more complex and computationally intensive. • For probability in any one state at any given frame time: • assume that represents last frame of a segment • consider all possible segment durations from 1 to some maximum D • therefore, must consider all possible previous states at all possible previous frame times from t-1 up to t-D.
feature value t Trajectory-based segmental HMMs • Approximate relation between successive feature vectors by some trajectory through feature space. • Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.
Segmental HMM probability calculations • Generate observations independently, but conditioned on the trajectory. • Aim to provide constraining model of dynamics without requiring a complex model of correlations. • BUT, trajectory may be different for different utterances of the same sound. • So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example. • One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.
feature value t Intra- and Extra-segmental variability • Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories. • This is achieved by modelling separately: • extra-segmentalvariation (underlying trajectory) • intra-segmentalvariation (about trajectory) => Probabilistic-trajectory segmental HMMs
probabilistic- trajectory segmental HMM standard HMM segmental HMM HMM states Comparing different models Generating a sequence of 5 observations
target intra- segmental variability extra-segmental variability t 1 D Probabilistic-trajectory segmental HMMs • Parametric trajectory model and Gaussian distributions. • Simple linear trajectory - characterized by mid-point and slope . • For illustration show with slope=0.
PTSHMM probability (general) • A segment of observations is y = y0,...,yT. • Probability of y and trajectory f given state S is extra-segmentalintra-segmental Alternative segmental models: 1. Define trajectory; model variation in trajectory 2. Fix trajectory and model observations - HMM is limiting case:
Linear Gaussian PTSHMM slope mid-point intra-segment • Gaussian distributions for slope, mid-point and intra-segment variance. • To use model in recognition, need to compute P(y|S). • but values of trajectory parameters m and c are not known - they are “hidden” from the observer. • Linear trajectory: slope m and mid-point c. • Joint probability of y and linear trajectory is:
Hidden-trajectory probability calculation • One possibility: estimate the location of the trajectory, and compute the probability for that trajectory. • Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate. • A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters. • In the case of the linear model, the calculation is:
Parameters of the linear PTSHMM • Linear PTSHMM has five model parameters: mid-point mean and variance, slope mean and variance, and intra-segment variance. • Simpler models arise as special cases, by fixing various parameters. • If trajectory slope is set to zero => “static” PTSHMM. • If prevent variability in trajectory => “fixed-trajectory” SHMM. • Fixed-trajectory static SHMM = standard HMM with explicit duration model.
Digit recognition experiments • Speaker-independent connected-digit recognition • 8 mel cepstrum features + overall energy • three-state monophone models • Segmental HMM max. segment dur. 10 frames (=> maximum phone duration = 300 ms). • Compared probabilistic-trajectory SHMMs with fixed-trajectory SHMMs and with standard HMMs. • Initialised all SHMMs from segmented training data (using HMM Viterbi alignment). • Interested in acoustic-modelling aspects, so fixed all transition and duration probabilities to be equal. • 5 training iterations.
Digit recognition results: simple SHMMs % Sub. % Del. %Ins %Err. Standard HMM 6.2 1.5 0.9 8.6 Add duration constraint 5.2 0.7 0.7 6.6 Linear fixed trajectory 3.8 0.5 0.6 4.9 • Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations). • Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).
Digit recognition results: static PTSHMMs %Sub. %Del. %Ins %Err. Static fixed SHMM 5.2 0.2 0.7 6.6 Static probabilistic SHMM5.2 2.2 0.1 7.5 • For static models, no advantage from distinguishing between extra- and intra-segmental variability.
Digit recognition results: linear SHMMs %Sub. %Del. %Ins %Err. Static fixed SHMM 5.2 0.2 0.7 6.6 Linear fixed trajectory 3.8 0.5 0.6 4.9 Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9 Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0 • Some advantage for linear trajectory. • Considerable further benefit from modelling variability in mid-point. • But modelling variability in both mid-point and slope is detrimental to recognition performance.
Conclusions from digit experiments Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate). => advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but: • Trajectory assumption must be reasonably accurate (advantage for linear but not for static models). • Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.
Phonetic classification: TIMIT • Training and recognition with given segment boundaries. • Train on complete training set (male speakers), with classification on core test set. • 12 mel cepstrum features + overall energy. • Evaluated (constrained) linear PTSHMMs. • Compared performance with standard-HMM performance for: • context-dependent (biphone) versus context-independent (monophone) models • feature set using only the mel cepstrum features versus one which also included time derivative features.
TIMIT classification results • Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models. => more benefit from modelling trajectories when not including different phonetic events in one model. • Most advantage when not using delta features. => most benefit from modelling dynamics when not attempting to represent dynamics in front-end.
Benefit of PTSHMMs for some different phone classes no. HMM PTSHMM %impro- examples %error %error ment Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8 Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1 Semivowels and glides(l r y w) 97 39.2 33.2 15.4 Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8 Stops (p t dx k b d g) 566 56.7 54.8 3.4 Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.
Summary of findings • Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs. • Separately modelling variability within/between segments is a powerful approach, provided that: • trajectory assumptions are appropriate (linear trajectory) • variability in the parameter can be usefully modelled (not useful to model variability in slope parameter with current approach). • The models have been shown to give useful performance gains.
Issues of modelling speech dynamics Compare error rates on TIMIT task: • HMMs with time derivatives: 29.8% • best segmental HMM result WITHOUT time derivatives: 38.2%. => time derivatives capture some aspects of dynamics not modelled in segmental HMMs. • Time derivative features provide some measure of dynamics for every frame. • current segmental HMMs only model dynamics within a segment.
modelling issues and questions (1) • Choice of model unit (e.g. phone, diphone) • How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance. • How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search) • How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)
modelling issues and questions (2) • Incorporating a realistic duration model. • How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories. • How to model speaker-dependent effects and speaker continuity. • How to deal with other systematic influences - e.g. speaker stress, speaking rate. • Dealing with external influences - e.g. noise. • Choice of features for trajectory modelling.
Spectral representations (1) • Typical wideband spectrogram - for display compute spectrum at frequent time intervals (e.g. 2 ms) th r ee s I x s I x • Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:
Spectral representations (2) • Using long windows at fixed positions blurs rapid events - stop bursts and rapid formant transitions. • An alternative: use a shorter window “excitation synchronously”: th r ee s I x s I x • Compare with long fixed-window analysis:
Standard HMM digit recognition experiments • Compared excitation-synchronous analysis with fixed analysis for different window lengths. • In all cases computed FFT then mel cepstrum. • Shorter window gives lower frequency resolution, but effect is not so great on mel scale. • Best fixed-window condition 20 or 25 ms: 2.1% err. (increased to 4.6% for a 5 ms window). • Best synchronous-window condition 10ms: 1.9% err. But only increased to 2.1 % for a 5 ms window. => some advantage to capturing rapid events. But note short window may be disadvantage for fricatives. Maybe combine different analyses?
Moving beyond cepstrum trajectories • Start with spectral analysis: this must preserve all relevant information. • But is it appropriate to then model trajectories directly in the spectral/cepstral domain? • Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences. => should be modelling in domain closer to articulation. • One possibility is an articulatory description. • Another option is formants - closely related to articulation but also to acoustics.
Problems with formant analysis • Unambiguous formant labelling may not be possible from a single spectral cross-section. e.g. close formants may merge to give single spectral peak • A formant may not be apparent in the spectrum. e.g. formant is weakly excited (F1 in unvoiced sounds). • NOT useful for certain distinctions, where low amplitude is the main feature. e.g. identifying silence or weak fricatives. => difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.
Estimating formant trajectories s i k s th r ee o ne • Where see clear formant structure, F1, F2 and F3 can be identified. • In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels. • For F1, arbitrarily connect between adjacent vowels.
Formant analysis methodJohn Holmes (Proc. EUROSPEECH’97) • Aims to emulate human abilities: • ability to label single spectrum cross-sections • rely heavily on continuity over time • sometimes need knowledge of what is being said to disambiguate alternatives • Two fundamental features of the method: • outputs alternatives when uncertain (“delayed decisions”). • Notion of “confidence” in formant measurement when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition => rely on other features (general spectrum shape).
Example of formant analyser output • Up to two sets of formants for each frame. • Alternatives are in terms of sets - F1, F2, F3. • Specified frame by frame, but are usually alternative trajectories. “four seven”
Segmental HMM experiments • Each segment model is associated with a linear trajectory. • Model each phone by a sequence of one or more segments. e.g. monophthongal vowels, fricatives - 1 segment diphthongs - sequence of 2 segments aspirated voiceless stops - sequence of 3 segments. • Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint). • Incorporate confidence estimate (as a variance) in recognition calculations. • Resolve formant alternatives based on probability. • Use formants + low-order cepstrum features.
Some connected-digit recognition results Word error rates 8 cep. 5 cep.+3 for. Standard-HMM baseline 3.5 % 2.5 % with 3 states per phone Standard HMMs with 6.4 % 5.9 % variable state allocation • Performance drops when introduce new state allocation (total number of states about half that of baseline) Introduce segment structure 3.2 % 2.9 % • Need segment structure for good performance Introduce linear trajectory 2.6 % 2.3 % • Some advantage from linear trajectory • Formants show small, but consistent, advantage.
Formant modelling • Expressing a model in terms of formant dynamics offers: • Potential for modelling systematic effects in a meaningful way: e.g speaker identity, speaker stress, speaking rate. • Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise). • BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors. • FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors. • A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.
A simple coding scheme • Demonstrate principles of coding using same model for both recognition and synthesis. • Model represents linear formant trajectories. • Recognition: linear trajectory segmental HMMs of formant features. • Synthesis: JSRU parallel-formant synthesizer. • Coding is applied to analysed formant trajectories => relatively high bit-rate (up to about 1000 bits/s). • Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.
Coded at about 600bps Speaker 1: digits Speaker 2: digits Speaker 3: digits Speaker 1: ARM report Natural Speaker 1: digits Speaker 2: digits Speaker 3: digits Speaker 1: ARM report Speech Coding results Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding. Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.