Modeling Speech using POMDPs

Modeling Speech using POMDPs • In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. • We use state of the art techniques to build and decode our new model. • We demonstrate improved recognition results on a small data set.

Description of a POMDP • A Markov Decision Processes (MDP) is a mathematical formalization of problems in which a decision maker, an agent, must decide which actions to choose that will maximize its expected reward as it interacts with its environment. • MDPs have been used in modeling an agents behavior in • planning problems • robot navigation problems • In a fully observable MDP an agent always knows precisely what state it is in.

If an agent cannot determine its state, its world is said to be partially observable. • In such a situation we use a generalization of MDPs, called a Partially Observable Markov Decision Process (POMDP). • POMDP vs HMM • differs from HMM • multiple transitions between two states representing actions • added reward to each state • as with HMM • you do not know which state you are in

POMDP in Speech As with HMMs left to right topology with 3 to 5 states states represent pronunciation tasks: beginning, middle, end of phoneme observed acoustic features are associated with each state Randomness in state transitions still accounts for time stretching in the phoneme: Short, long, hurried pronunciations Randomness in the observations still accounts for the variability in pronunciations

Differs from HMMs In theory model all possible context classes (infinite number) model all contexts of a particular context class In practice model three context classes Triphone, biphone, monophone model all contexts of a particular context class Use actions of our model to represent context Beg. Mid. End

Training a POMDP • We train each context class independently on the same training data • treated as HMM models trained using standard EM • We then collect all context models for each phoneme over the four different context classes and combine them into a single, unified POMDP model • we label each action with both the context and context class that the particular HMM model belongs to

Decoding a POMDP • We look at 3 decoding strategies based on Viterbi: • Uniform Mixed Model (UMM) Viterbi • Weighted Mixed Model (WMM) Viterbi • Cross-Context Mixed Model (CMM) Viterbi

UMM Viterbi • From Viterbi point of view • Add to mix all context classes and allow Viterbi to choose the best path through entire search space • relax context rules by matching up all partial context phonemes • wild-card all monophones to match up with all biphones and triphones sharing same center phone • wild-card all biphones to mach up with triphones whose other context they share • Add class weight, Wc, to each context class c • applied to each model as we enter it • From POMDP point of view • the model constrains actions • add constraint to leave state with same action that we entered that state in the model • insures model’s context as in HMM

relax constraint of allowing to choose different context classes in model • differs from HMM • class weight is reward given at start state for entering model Viterbi expansion of “tomato” having two spellings • “t-ow-m-ey-t-ow” • “t-ow-m-aa-t-ow” • (a) standard Viterbi and (b) UMM Viterbi m+ev ow-m+ey ow+m ow-m+ey t-ow+m t-ow+m m ow-m+aa ow-m+aa ow ow-m-aa m+aa (a) (b)

WMM Viterbi • Similar to UMM Viterbi, except now we weigh each context model of each context class individually, based on frequency counts of its occurrence in training data wcm = Lc + min(fcm / Kc, 1) * (Wc – Lc) • fcm – frequency count for model m of context class c • Lc – lower bound for context class c • Wc – upper bound for context class c • Kc – frequency count cutoff threshold for context class c

CMM Viterbi • Similar to WMM Viterbi, except now our POMDP model relaxes the constraint on actions • allows cross model jumps • jumps are now weighted by model weight wcm • constraint relaxed to sub-class of context models as follows: • models can jump between triphone and associated biphone and monophone whose partial context they share

t-ow+m ow ow+m • Various strategies to relaxing cross model jump constraints • Maximum cross context • for each cross context model jump, add weight to the likelihood score and choose jump that yields highest score • Expanded cross context • choose all context model jumps at every state, adding the weight to the likelihood score of each jump • Restricted form of both Maximum and Expanded • add constraint that once we choose a lower order context class model, cannot go back to higher order context class model, only stay within own or lower • idea is to abandon higher order models that perform poorly

Experiments • Tested our model on TIMIT data sets: • TIMIT – read English sentences • 45 phonemes, ~8000 word dictionary • 3 hours training on 3869 utterances by 387 speakers • 6 minute decoding on 110 utterances by 11 speakers • independent of training data • trigram language model built from training data and outside source (OGI: Stories and NatCell )

Baseline • Found best system configuration for each corpus. • created 16 mixture SCTM models for each HMM context class using ISIP prototype system (v 5.10) • ran baseline for all 3 HMM models

Results • Results for all three modified Viterbi algorithms similar to development set • POMDP model shows robustness to different test sets • not tuned to data

Future Work • Apply new model to larger data set • Find better method to generate individual context model weights • linear interpolation and backoff techniques used in language modeling • Find better method for adjusting overall POMDP model context class weights for the various decoding strategies • current method of experimentation is inefficient • For CMM Viterbi, look to find better ways to constrain cross model jumps outside of partial context classes • use similar technique of linguistic information used in tying mixtures at the state level

Modeling Speech using POMDPs

Modeling Speech using POMDPs

Presentation Transcript

Using direct speech

Using Speech Recognition for Speech Therapy

Speech and Language Modeling

Using Speech Recognition

Using Speech Marks

Policies for POMDPs

NONLINEAR STATISTICAL MODELING OF SPEECH

Prosody Modeling (in Speech)

Nonlinear Statistical Modeling of Speech

POMDPs

Approximate POMDPs using Point-based Value Iteration

Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs

Solving POMDPs Using Quadratically Constrained Linear Programs

Policy Improvement for POMDPs using gradient ascent

Decision-making on Robots Using POMDPs

Language Modeling for Speech Recognition

POMDPs

Acoustic Modeling for Speech Recognition