1 / 25

Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion. Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano Nara Institute of Science and Technology (NAIST), Japan August 23rd, 2007. Amusement device

Download Presentation

Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano Nara Institute of Science and Technology (NAIST), Japan August 23rd, 2007

  2. Amusement device Speech enhancement device for a speaking aid system recovering a disabled person’s voice for a hearing aid system to make speech sounds more intelligible Technique for converting user’s voice quality into another one Voice Quality Control Applications Controller Hello. Development of voice quality control with high quality and high controllability is desired!

  3. Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

  4. Let’s convert. Let’s convert. One-to-Many Eigenvoice Conversion (EVC) [Toda et al., 2006] • A source speaker’s voice is statistically converted into an arbitrary speaker’s one. Multiple pre-stored target speakers Hello. Thank you. Parallel data Source speaker Hello. Thank you. Hello. Thank you. Hello. Thank you. Eigenvoice GMM (EV-GMM) Training Manually setting Conversion Arbitrary speakers

  5. Weights for eigenvoices (free parameters) Eigenvoice GMM (EV-GMM) : Speaker independent parameters Parameters ofthe i th mixture • Converted voice quality is controlled by weights for eigenvectors. Weight : Free parameters Covariance matrix Eigenvectors (for eigenvoices) Bias vector (for average voice) Mean vector = + Source mean vector Target mean vector Problem: eigenvoices do NOT represent a specific physical meaning (such as a masculine voice or a clear voice). Intuitive control of the converted voice quality is difficult!

  6. Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

  7. Proposed Framework We would like to intuitively control the converted voice quality! We propose multiple regression approaches to one-to-many EVC. Converted voice quality is controlled with the voice quality control vector. * Similar approaches have been proposed in HMM-based speech synthesis [Tachibana et al., 2006].

  8. Process of Proposed Framework 1. Preparing multiple parallel data sets 2. Setting the voice quality control vector for every pre-stored target speaker 3. Modeling the target mean vectors with voice quality control vector 1. Preparing multiple parallel data sets 2. Setting the voice quality control vector for every pre-stored target speaker 3. Modeling the target mean vectors with voice quality control vector

  9. We manually assign scores for expression word pairs to each pre-stored target speaker. Assigned scores are used as components of the voice quality control vector. 2 1 1 -2 -1 Setting Voice Quality Control Vector -3 -2 -1 0 1 2 3 Some-what No preference Some-what Very Quite Quite Very Masculine Feminine Hoarse Clear Elderly Youthful Thin Deep Lax Tense Voice quality control vector for the speaker A Assigned scores for the speaker A

  10. Process of Proposed Framework 1. Preparing multiple parallel data sets 2. Setting the voice quality control vector for every pre-stored target speaker 3. Modeling the target mean vectors with voice quality control vector We propose 3 regression methods.

  11. Proposed Method A Least-squares (LS) estimation of regression parameters converting the voice quality control vector into principal components • Modeling principal components is modeled by Regression parameters Voice quality control vector for the sth target speaker Principal components for the sth target speaker • Minimizing the following error function: Error of principal components for the sth pre-stored target speaker Total error over all pre-stored target speakers

  12. Resulting EV-GMM in Method A : Training parameters Parameters ofthe i thmixture : Speaker independent EV-GMM parameters Weight Covariance matrix Eigenvectors Bias vector Mean vector = + + Target mean vector Regression parameters Voice quality control vector Problem: the desired voice characteristics might not be represented as a linear combination of eigenvectors. Changing the eigenvectors themselves is necessary!

  13. Proposed Method B LS estimation of a regression parameters converting the voice quality control vector into the target mean vectors • Target mean vector is modeled by Regression parameters = + Target mean vector for the sth target speaker Voice quality control vector for the sth target speaker • Minimizing the following error function: Error of target mean vectors for the sth pre-stored target speaker Total error over all pre-stored target speakers

  14. Resulting EV-GMM in Method B : Training parameters Parameters ofthe ithmixture : Speaker independent EV-GMM parameters Weight Covariance matrix Regression parameters Mean vector = + Voice quality control vector Target mean vector Problem: the desired voice quality might not be obtained because the converted voice quality is affected by all EV-GMM parameters.

  15. Proposed Method C Maximum Likelihood (ML) estimation of all EV-GMM parameterswhile fixing the voice quality control vector • Target mean vector is modeled by Regression parameters = + Target mean vector for the sth target speaker Voice quality control vector for the sth target speaker • Maximizing the following likelihood function: Likelihood of the adapted EV-GMM for each pre-stored target speaker Total likelihood over all pre-stored target speakers * This process is considered as speaker adaptive training (SAT) of EV-GMM [Ohtani et al., Interspeech 2007].

  16. Resulting EV-GMM in Method C : Training parameters Parameters ofthe ithmixture Weight Covariance matrix Regression parameters Mean vector = + Voice quality control vector Target mean vector

  17. Comparison of Proposed Methods

  18. Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

  19. Verification of Proposed Methods • Objective verification • Subjective verification Experimental conditions

  20. Objective Verification Is a correspondence of the voice quality control vector into the converted voice quality appropriately modeled? • For each pre-stored target speaker in the training data, the following two voice quality control vectors were compared. 1. Manually assigned one 2. Adjusted one on the trained EV-GMM so that the converted voice quality becomes similar to the target * approximately determined by maximum likelihood eigen-decomposition for EV-GMM [Toda et al., 2006] using two sentences • Euclidean distance and correlation coefficient between those two vectors were calculated as objective measures.

  21. Results of Objective Verification Better Worse Better! Too consistent compared with human judgment? Better! Better Worse * Reassigned: assigned scores by the same listener a second time on a different day 1. The method A does not work at all. 1. The method A does not work at all. 2. The method B works but not so good. 1. The method A does not work at all. 2. The method B works but not so good. 3. The method C works reasonably well.

  22. Subjective Verification Which is better, the method B or the method C? • Preference test on the converted speech quality was conducted. • Comparison of average voices* by the trained EV-GMMs * converted voices when setting every component of the voice quality control vector to zero Having very similar speaker individuality in both method B and C Experimental conditions

  23. Result of Subjective Verification • The method B outperforms the method C. Possibility to be thought • The EV-GMM parameters trained in EM algorithm converged to local optima due to using inappropriate initial model (i.e., the target independent GMM).

  24. Contents 1. Conventional voice quality control methods 2. Proposed voice quality control methods 3. Experimental verification 4. Conclusions

  25. Conclusions • Proposal of regression approaches to the voice quality control based on one-to-many eigenvoice conversion (EVC) • Based on a statistical conversion framework • Allowing intuitive control of converted voice quality with voice quality control vector • Experimental verification • Showing the possibility that voice quality control with high quality and high controllability is realized.

More Related