1 / 15

Transformation of Short-Term Spectral Envelope of Speech Signal

Transformation of Short-Term Spectral Envelope of Speech Signal Using Multivariate Polynomial Modeling P. K. Lehana P . C. Pandey { lehana , pcpandey }@ ee.iitb.ac.in EE Dept, IIT Bombay 30 th January, 2011. 1/15. PRESENTATION OUTLINE. 1. Introduction

kira
Download Presentation

Transformation of Short-Term Spectral Envelope of Speech Signal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transformation of Short-Term Spectral Envelope of Speech Signal Using Multivariate Polynomial Modeling P. K. Lehana P. C. Pandey {lehana, pcpandey}@ee.iitb.ac.in EE Dept, IIT Bombay 30th January, 2011 1/15

  2. PRESENTATION OUTLINE 1. Introduction 2. Multivariate Polynomial Modeling 3. Methodology 4. Results 5. Conclusion 2/15

  3. 1. INTRODUCTION Speaker transformation Modification of the speech signal of the source speaker to make it perceptually similar to that of the target speaker. Processing steps in transformation Estimation of mapping ▫ Estimation of the source and the target parameters ▫ Alignment of the parameters ▫ Estimation of the source-to-target transformation function(s) Transformation of source speech ▫ Estimation of the source parameters ▫ Application of the transformation function(s) on the source parameters ▫ Generation of the transformed speech 3/15

  4. Spectral parameters for transformation • Formant frequencies • Line spectral frequencies (LSFs) • Cepstral coefficients • Mel frequency cepstrum coefficients (MFCCs): robust w.r.t. to noise, coefficients uncorrelated with each other and hence suitable for interpolation. • Transformation methods • Vector quantization (Shikano, 86): degradation in the output speech quality due to discretization of the acoustic space. • Statistical and ANN(Narendranath, 98; Stylianou, 98; Ye, 06): large set of training data and computation needed. • Frequency warping and interpolation (Rinscheid, 96; Hashimoto, 96; Jian, 07; Masuda, 07; Valbret, 92): different transformation functions needed for different acoustic classes. 4/15

  5. Research objective • Modification of spectral characteristics by • modeling the source-target relationship using • a single mapping applicable to all acoustic classes, by • modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech, • harmonic plus noise model (HNM) based analysis- synthesis. 5/15

  6. 2. MULTIVARIATE POLYNOMIAL MODELING Modeling Approximation of m-dimensional function g, known at q points (wn), by a multivariate polynomial with terms Фkand error n Coefficients ck obtained for minimizing the sum of squared errors. Application ▫ Relationship between the parameters of the corresponding source and target frames obtained by modeling each parameter of the target speech as a multivariate polynomial function of all the parameters of the source speech. ▫ Each parameter of a target frame obtained as the corresponding function of all the parameters of the corresponding source frame. 6/15

  7. 3. METHODOLOGY • Processing • HNM based analysis-synthesis as platform for transformation • ▫ Harmonic band parameters: voicing, pitch, max. voiced frequency, harmonic magnitudes and phases. • ▫ Noise band parameters: LP coefficients and energy. • Modification of parameters • ▫ Harmonic magnitudes converted to MFCCs (20), transformed, & converted back to magnitudes; phases estimated by minimum-phase approximation. • ▫ LP coeffs (20). converted to LSFs, transformed, & converted back to LP coeffs. Different transformation fns. for the voiced and the unvoiced frames. • ▫ Linear transformation for time and pitch scaling. 7/15

  8.  Estimation of spectral transformation functions  Transformation of source speech  Transformation functions investigated ▫ Univariate linear (UL) ▫ Multivariate linear (ML) ▫ Multivariate quadratic (MQ) 8/15

  9. Evaluation • Material • A Hindi story with 80 sentences (10 kHz, 16 bits) from 5 speakers (2 M, 3 F). • 77 sentences used for training, 3 for testing. • Preliminary evaluation • ▫ Unity transformation (same speaker as the source and the target) • Identity not disturbed, a small degradation in quality. • ▫ Pitch modification • Target identity not achieved, quality degradation similar to the unity transformation. • ▫ Spectral modification • Source identity changed towards target for the same gender transformation, slightly higher degradation in quality. • ▫ Spectral modification along with pitch and time scaling • Source identity close to the target for all the speaker pairs, quality same as in spectral modification. 9/15

  10. Example: “Vah padnelikhane men bahutachchhatha” S T Tr_UL Tr_ML Tr_MQ F1-F2 F1-M2 S T Tr_UL Tr_ML Tr_MQ M1-M2 M1-F2 10/15

  11. Objective evaluation • Mahalanobis distance between two set of MFCC feature vectors (P,Q) • , • where P corresponds to the target speech and Q corresponds to the source or the transformed speech. • Subjective evaluation • XAB and MOS test (automated administration) • ▫ Source, target, or modified randomly presented as X. Source or target randomly presented as A or B. • ▫ No. of subjects: 6 • ▫ Material: 2 sentences for each of the 4 speaker pairs • ▫ No. of presentations for each stimulus: 3 11/15

  12. 4. RESULTS • Mahalanobis distance of the target MFCCs Distance Transformation F1-F2 F1-M1 M1-F2 M1-M2 Source 0.51 0.65 0.64 0.53 Tr_UL 0.68 0.65 0.61 0.64 Tr_ML 0.45 0.47 0.44 0.43 Tr_MQ 0.38 0.39 0.38 0.33 Highest reduction in the target-transformed distance for MQ based transformation • XAB score (2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs) • Transformed: Tr_MQ along with pitch modification and time scaling Identification errors Source: 6 % Target: 4 % Transformed: 8 % • MOS score(2 sentences × 3 presentations × 6 listeners, averaged across the 4 speaker pairs) Transformation UL ML MQ Score 1.7 2.8 3.1 12/15

  13. Demo S: source, T: target, PM: pitch modified, SM: spectrum modified, TS: time scaled UL: univariate linear, ML: multivariate linear, MQ: multivariate quadratic 13/15

  14. 5. CONCLUSION •  Modification of spectral characteristics feasible by modeling the source-target relationship using multivariate polynomial functions for a single mapping applicable to all acoustic classes, without extensive training or labeling. •  Methods investigated for transformation function: UL, ML, MQ. MQ resulted in satisfactory identity transformation and fair quality. •  Further work • ▫ Listening tests involving larger number of speaker pairs and listeners. • ▫ Comparison with other transformation techniques. 14/15

  15. Thank you 15/15

More Related