230 likes | 235 Views
This study explores adaptive estimation and Bayesian adaptation of CDHMM parameters for speaker adaptation in continuous density hidden Markov models.
E N D
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE
Outlines • Introduction • Adaptive estimation of CDHMM parameters • Bayesian adaptation of Gaussian parameters • Experimental setup and recognition results • Summary
Introduction • Adaptive learning • Adapting reference speech patterns or models to handle the situations unseen in the training phase • For example: • Varying channel characteristics • Changing environmental noise • Varying transducers
Introduction – MAP • Maximum a posteriori (MAP) • Also called Bayesian adaptation • Under the given prior distribution, MAP tries to maximize the posterior distribution
Adaptive estimation of CDHMM parameters • Sequence of SD observation Y = {y1, y2, …,yT} • λ = parameter set of the distribution function • Given a training data Y, we want to estimate λ • If λ is assumed random with a prior distribution function P0(λ), the MAP estimate for λis obtained by solving (MLE) Likelihood function Prior Distribution Language Model Posterior Distribution
Adaptive segmental K-Means algorithm • Maximization of the state-likelihood of the observation sequences in an iterative manner using the segmental k-means training algorithm • s = state sequence • For a given model , find the optimal state sequence • Based on a state sequence , find the MAP estimate
The choices of prior distributions • Non-informative prior • Parameters are fixed but unknown and are to be estimated from the data • No preference to what the value of the parameters should be • MAP = MLE • Informative prior • Knowledge about the parameters to be estimated is known • Choice of prior distribution depends on the acoustic models used to characterize the data
Conjugate prior • Prior and posterior probabilities belong to the same distribution family • Analytical forms of some conjugate priors are available
Bayesian adaptation of Gaussian parameters • 3 implementations of Bayesian adaptation • Gaussian mean • Gaussian variance • Gaussian mean and precision • μ= mean and σ2= variance of one component of a state observation distribution • Precision
Bayesian adaptation of the Gaussian mean • Observation • μ is random • σ2 is fixed and known • MAP estimate for the parameter μ is: • where
Bayesian adaptation of the Gaussian mean (cont.) • MAP converges to MLE when • A large number of samples are used for training. • Relatively large value for prior variance τ2 is chosen (τ2 >> σ2 / n). (non-informative prior)
Bayesian adaptation of the Gaussian variance • Mean μ is estimated from sample mean • Variance σ2is given by an informative prior: • σmin2 is estimated from a large collection of speech data
Bayesian adaptation of the Gaussian variance (cont.) • Variance parameter is: • Sy2 is the sample variance • Effective when insufficient amount of sample data is available
Bayesian adaptation of both Gaussian mean and precision • Both mean and precision parameters are random • The joint conjugate prior P0(μ,θ) is a normal-gamma distribution Normal Distribution Gamma Distribution
Bayesian adaptation of both Gaussian mean and precision (cont.) • MAP estimate of μ and σ2can be derived as: • Prior parameters can be estimated as follows:
Experimental setup • SD data • 5 training utterances per word for each male speaker and 7 utterances for each female speaker • SD testing data • 10 utterances per word per speaker • Recorded over local dialed-up telephone lines • Sampling rate = 6.67kHz • 39 words vocabulary • 26 English letters • 10 digits • 3 command words (stop, error, repeat) • 2 sets of speech data • SI data for SI model, 100 speakers (50F50M) • SD data for adaptation, 4 speakers (2F2M)
Experimental setup (cont.) • Models are obtained by using the segmental k-means training procedure • Maximum number of mixture component per state = 9 • Diagonal covariance matrix • 5-state HMM • 2 sets of SI models • 1st set: as described above • 2nd set: single Gaussian distribution per state
Experimental results 1 • Baseline recognition rate: • SD: 2 Gaussian mixtures per state per word
Experimental results 2 • 5 adaptation experiments: • EXP1: SD mean and an SD variance (regular MLE) • EXP2: SD mean and a fixed variance estimate • EXP3: SA mean (3.1) with prior parameters (3.2-3.3) • EXP4: SD mean and an SA variance (3.5) • EXP5: SA estimates (3.7) with prior parameters (3.8-3.9)
Experimental results 4 SD mean, SA variance (method 2) SA mean and precision (method 3) SD mean, SD variance (MLE) SD mean, fixed variance SA mean (method 1)
Conclusions • Average recognition rate with all token incorporated = 96.1% • Performance improves when • More adaptation data are used • Both mean and precision are adapted