250 likes | 383 Views
LORIA. Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007. Missing Data : previous approach. Hypothesis: some coefficients of feature vector are masked by noise marginalization : to replace p(Y|M) by integration Approach presented before: y = x + n
E N D
LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007
Missing Data : previous approach • Hypothesis: some coefficients of feature vector are masked by noise • marginalization : to replace p(Y|M) by integration • Approach presented before: y = x + n (additive case because we are in the spectral domain) • Two cases: • If SNR > 0 If SNR < 0 • x>n then y/2<x<y x<n then 0<x<y/2 y y n x n n ….. ….. y/2 y/2 x x 0 0
WP1 : Missing Data • modified approach presented before • Better approximation of the interval of marginalization
Missing Data : new approach • To chose the integral limits in function of the mask estimation Interval of marginalization will be smaller
Proposed masks Noisy speech spectrum Y Clean speech spectrum X • Each Time-Frequency unit is a scalar ( in [0;1] ) which is the relative contribution of speech energy in the observed signal. • Different from mask based on SNR where each unit give the probability that the corresponding pixel is missing.
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Proposed masks • Each cluster k is represented by: • a mean vector: μk = (μ1 , … , μN ) • a diagonal covariance matrix: Σk = diag(σ1 , … , σN) • Clusters can be seen as pdfs of the contribution of speech energy in the noisy observed signal. • We propose to consider these clusters as potential missing data masks for any noisy input frame
Missing data :training • For each mask k a GMM model is trained with observation on the noisy frames Y aligned with Mk • Construction of ergodic HMM with previous GMMs
Missing data : recognition • Use ergodic HMM to find the mask k for each frame • Each frame y(t) -> one state -> mask • Use mik and sik of Mk to define the marginalization interval: • [mik - 2 sik , mik +2 sik] • Marginalization:
Missing Data: Experiments • Parameterization • Spectral domain 12 Mel bands + + D • training • HMM models on clean Aurora4 + adaptation with 50 first sentences HIWIRE clean • Mk : trained on noisy HIWIRE (50 first sentences) LN+MN+HM+clean • Test • Noisy HIWIRE (50 last sentences)
Visualisation of the marginalisation intervals on an example One spectral coefficient for word « standby » New method Previous method Clean LN
Visualisation of the marginalisation intervals on an example MN new method previous method HN
WER evaluation new previous
WER based evaluation • Comparison with ETSI AFE: New
Results WER % previous new Oracle : X/Y -> Mk -> marginalisation
New method : High Noise problem True value is outside of the marginalization interval
Conclusion • Better approximation of the interval of marginalization gives better recognition results especially for LN and MN conditions • But mask estimation must be improved in MN and HN conditions
WP2: Non-native speech recognition • Previous work • 2 sets of models: • TIMIT HMM models • Native (Fr, It, Gr, Sp) HMM models • Confusion rules • Integration of the rules in HMM • New study: • Different sets of models
Different sets of models • TIMIT models (canonical English models) • Native models L={Fr, It, Sp, Gr} • MLLR adapted models • TIMIT HMM adapted on HIWIREL • MAP adapted models • TIMIT HMM adapted on HIWIREL • Re-estimated models • TIMIT HMM + Baum-Welch iterations using HIWIREL
Experimental conditions • Adaptation and re-estimation: • Cross-validation system (leave one out): • All speakers exept one for adaptation or re-estimation • The remaining speaker for testing
Results HMM TIMIT MLLR adaptation with HIWIRE HIWIRE grammar TIMIT+ native Retraining on HIWIRE MAP adaptation with HIWIRE Word loop grammar
Results with confusion rules integrated in HMM (HIWIRE grammar) WER SER Baseline 7.2 14.6 5.3 10.2 5.8 11.8 4.8 10.9 3.5 8.1 2.8 6.4 2.8 6.5 2.1 5.0 Best result with TIMIT HMM models (canonical English) + retrained models
Results with speaker adaptation • Using the best system of the previous slide (confusion rules integrated in TIMIT HMM + re-estimation) we add a speaker adaptation step: • 50 first sentences per speaker for adaptation • MAP adaptation • Hiwire grammar • WER : 1.4% • SER : 3.2%
Conclusion • Different sets of models have been tested • Baseline results : • WER : 7.2% SER : 14.6% • Best result is obtained with Confusion with TIMIT HMM + re-estimation+MAP speaker adaptation : • WER : 1.4% SER : 3.2%
Extracted rules /t/ // /t/ /t/ /k/ // /t/ // Modifed structure of HMM for model /t/ Example of acoustic model modification for english phone /t/ English phones French phones English model French models