160 likes | 298 Views
Topics. Recognition results on Aurora noisy speech database Proposal of robust formant estimation from MFCCs Availability of real in-car speech databases Contact from Pi Research. Robust Formant Prediction from MFCCs.
E N D
Topics • Recognition results on Aurora noisy speech database • Proposal of robust formant estimation from MFCCs • Availability of real in-car speech databases • Contact from Pi Research b.milner@uea.ac.uk
Robust Formant Prediction from MFCCs • One of the aims of this integrated project is to use the speech recogniser to provide clean speech information for the speech enhancement component • Proposal is to use the speech recogniser to provide robust formant information from noisy speech • Review previous work on predicting pitch from MFCC vectors • Extension to proposed prediction of formants b.milner@uea.ac.uk
Pitch Prediction from MFCCs • In speech recognition most common feature extracted is the mel-frequency cepstral coefficient (MFCC) • This is designed for class discrimination and contains spectral envelope information • Excitation information (pitch) is lost through smoothing processes • Project at UEA aimed at reconstructing speech from MFCC vectors - therefore needed additional pitch estimate or prediction of pitch b.milner@uea.ac.uk
MFCC Extraction • Mel Frequency Cepstral Coefficients (MFCC) • designed for speech recognizer • simulate human perceptual ability • currently give best recognition performance • extract information of vocal tract • ignore most of speaker information, such as pitch speech Framing,Pre-emphasis and windowing FFT and Magnitude Spectrum Mel Filterbank Log( ) DCT Truncation 13-D MFCCs b.milner@uea.ac.uk
Pitch Prediction from MFCC vectors • There is clearly no global correlation between pitch frequency and spectral envelope (or MFCC vector) • There does exist a class-dependent correlation - the classes being different speech sounds • If this class-based correlation can be modelled then prediction of pitch from spectral envelope, or MFCC, should be possible • Investigate two methods for modelling this correlation • GMM • HMM b.milner@uea.ac.uk
Class-based GMM Pitch Prediction Training phase • Introduce augmented feature vector y = [x, f] • Model joint distribution by clustersing to form a GMM - tested from 64 to 128 clusters Pitch Prediction • During prediction stage only have MFCC component x • Pitch is predicted using MAP algorithm from the means and covariance of the clusters • Does not fully exploit the class-based correlation between the MFCC vector and pitch x f b.milner@uea.ac.uk
HMM Pitch Prediction • GMM does not model the temporal correlation of pitch • GMM clusters are trained unsupervised - may be better to used supervised training x Training phase • Model joint distribution of pitch and MFCC using a series of HMMs Pitch Prediction • Perform standard Viterbi decoding of MFCC stream in the HMM • Use model and state sequence information to locate mapping for each MFCC vector and then use MAP to predict pitch f l1 l2 b.milner@uea.ac.uk
Pitch Prediction Results • Aurora database - 200 utterances for training (50 speakers), 90 utterances for testing (23 speakers) • 42,902 frames in total b.milner@uea.ac.uk
Reconstructed Speech original MFCC+ reference pitch MFCC+HMM-based pitch b.milner@uea.ac.uk
Extension to Formant Prediction • Prediction of formants may also be possible from MFCC vectors using similar strategy of modelling joint distribution y = [x, f1, f2, f3, f4, …] • Potentially stronger correlation between formant and MFCCs than pitch and MFCCs • Use Brunel format estimator to provide frequency, bandwidth, amplitude of formants b.milner@uea.ac.uk
Why Predict Formants? • Formant estimation from noisy speech is a difficult task and prone to errors • Predicting them from MFCCs may be more robust • Before prediction can apply noise compensation methods to MFCCs (spectral subtraction/Wiener) • Alternatively model the joint distribution of noisy MFCCs and formants • In effect utilise the correlation information available inside the speech models themselves • Formant predictions provide clean speech information necessary for speech enhancement component of project b.milner@uea.ac.uk
Noisy Speech Databases • Two more noisy speech databases available • SpeechDat-Car - Danish • SpeechDat-Car - Spanish • Connected digit strings recorded in a moving car under different driving conditions. • Both hands-free and close-talking microphone • Available through SIG in COST278 - will request availability to other partners b.milner@uea.ac.uk
Pi Research • Pi Research in Cambridge specialise in data communication in Formula 1 racing • Made an approach regarding possibility of reducing noise on driver-to-pit crew communication • Example - down to SNRs of -30dB b.milner@uea.ac.uk
Pi Research b.milner@uea.ac.uk
End b.milner@uea.ac.uk