150 likes | 283 Views
Transformation of Short-Term Spectral Envelope of Speech Signal Using Multivariate Polynomial Modeling P. K. Lehana P . C. Pandey { lehana , pcpandey }@ ee.iitb.ac.in EE Dept, IIT Bombay 30 th January, 2011. 1/15. PRESENTATION OUTLINE. 1. Introduction
E N D
Transformation of Short-Term Spectral Envelope of Speech Signal Using Multivariate Polynomial Modeling P. K. Lehana P. C. Pandey {lehana, pcpandey}@ee.iitb.ac.in EE Dept, IIT Bombay 30th January, 2011 1/15
PRESENTATION OUTLINE 1. Introduction 2. Multivariate Polynomial Modeling 3. Methodology 4. Results 5. Conclusion 2/15
1. INTRODUCTION Speaker transformation Modification of the speech signal of the source speaker to make it perceptually similar to that of the target speaker. Processing steps in transformation Estimation of mapping ▫ Estimation of the source and the target parameters ▫ Alignment of the parameters ▫ Estimation of the source-to-target transformation function(s) Transformation of source speech ▫ Estimation of the source parameters ▫ Application of the transformation function(s) on the source parameters ▫ Generation of the transformed speech 3/15
Spectral parameters for transformation • Formant frequencies • Line spectral frequencies (LSFs) • Cepstral coefficients • Mel frequency cepstrum coefficients (MFCCs): robust w.r.t. to noise, coefficients uncorrelated with each other and hence suitable for interpolation. • Transformation methods • Vector quantization (Shikano, 86): degradation in the output speech quality due to discretization of the acoustic space. • Statistical and ANN(Narendranath, 98; Stylianou, 98; Ye, 06): large set of training data and computation needed. • Frequency warping and interpolation (Rinscheid, 96; Hashimoto, 96; Jian, 07; Masuda, 07; Valbret, 92): different transformation functions needed for different acoustic classes. 4/15
Research objective • Modification of spectral characteristics by • modeling the source-target relationship using • a single mapping applicable to all acoustic classes, by • modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech, • harmonic plus noise model (HNM) based analysis- synthesis. 5/15
2. MULTIVARIATE POLYNOMIAL MODELING Modeling Approximation of m-dimensional function g, known at q points (wn), by a multivariate polynomial with terms Фkand error n Coefficients ck obtained for minimizing the sum of squared errors. Application ▫ Relationship between the parameters of the corresponding source and target frames obtained by modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech. ▫ Each parameter of a target frame obtained as the corresponding function of all the parameters of the corresponding source frame. 6/15
3. METHODOLOGY • Processing • HNM based analysis-synthesis as platform for transformation • ▫ Harmonic band parameters: voicing, pitch, max. voiced frequency, harmonic magnitudes and phases. • ▫ Noise band parameters: LP coefficients and energy. • Modification of parameters • ▫ Harmonic magnitudes converted to MFCCs (20), transformed, & converted back to magnitudes; phases estimated by minimum-phase approximation. • ▫ LP coeffs (20). converted to LSFs, transformed, & converted back to LP coeffs. Different transformation fns. for the voiced and the unvoiced frames. • ▫ Linear transformation for time and pitch scaling. 7/15
Estimation of spectral transformation functions Transformation of source speech Transformation functions investigated ▫ Univariate linear (UL) ▫ Multivariate linear (ML) ▫ Multivariate quadratic (MQ) 8/15
Evaluation • Material • A Hindi story with 80 sentences (10 kHz, 16 bits) from 5 speakers (2 M, 3 F). • 77 sentences used for training, 3 for testing. • Preliminary evaluation • ▫ Unity transformation (same speaker as the source and the target) • Identity not disturbed, a small degradation in quality. • ▫ Pitch modification • Target identity not achieved, quality degradation similar to the unity transformation. • ▫ Spectral modification • Source identity changed towards target for the same gender transformation, slightly higher degradation in quality. • ▫ Spectral modification along with pitch and time scaling • Source identity close to the target for all the speaker pairs, quality same as in spectral modification. 9/15
Example: “Vah padnelikhane men bahutachchhatha” S T Tr_UL Tr_ML Tr_MQ F1-F2 F1-M2 S T Tr_UL Tr_ML Tr_MQ M1-M2 M1-F2 10/15
Objective evaluation • Mahalanobis distance between two set of MFCC feature vectors (P,Q) • , • where P corresponds to the target speech and Q corresponds to the source or the transformed speech. • Subjective evaluation • XAB and MOS test (automated administration) • ▫ Source, target, or modified randomly presented as X. Source or target randomly presented as A or B. • ▫ No. of subjects: 6 • ▫ Material: 2 sentences for each of the 4 speaker pairs • ▫ No. of presentations for each stimulus: 3 11/15
4. RESULTS • Mahalanobis distance of the target MFCCs Distance Transformation F1-F2 F1-M1 M1-F2 M1-M2 Source 0.51 0.65 0.64 0.53 Tr_UL 0.68 0.65 0.61 0.64 Tr_ML 0.45 0.47 0.44 0.43 Tr_MQ 0.38 0.39 0.38 0.33 Highest reduction in the target-transformed distance for MQ based transformation • XAB score (2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs) • Transformed: Tr_MQ along with pitch modification and time scaling Identification errors Source: 6 % Target: 4 % Transformed: 8 % • MOS score(2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs) Transformation UL ML MQ Score 1.7 2.8 3.1 12/15
Demo S: source, T: target, PM: pitch modified, SM: spectrum modified, TS: time scaled UL: univariate linear, ML: multivariate linear, MQ: multivariate quadratic 13/15
5. CONCLUSION • Modification of spectral characteristics feasible by modeling the source-target relationship using multivariate polynomial functions for a single mapping applicable to all acoustic classes, without extensive training or labeling. • Methods investigated for transformation function: UL, ML, MQ. MQ resulted in satisfactory identity transformation and fair quality. • Further work • ▫ Listening tests involving larger number of speaker pairs and listeners. • ▫ Comparison with other transformation techniques. 14/15
Thank you 15/15