1 / 28

Simulating Masan Dialect with Seoul Dialect: Prosody Transfer Study

Explore simulating Masan dialect with Seoul dialect through prosody transfer using the PSOLA algorithm for multi-dialectal TTS systems. Test the viability of simulating dialects combining speech segments and prosody elements.

Download Presentation

Simulating Masan Dialect with Seoul Dialect: Prosody Transfer Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dialect Simulation through Prosody Transfer:A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam University The Autumn Conference of The Association of Modern British & American Language & Literature University of Ulsan, 2006. 11. 04

  2. Table of Contents • Background & motivation • Goal of the current work • Prosody transfer (PSOLA algorithm) • Preparation of stimuli • Listening test & evaluation • Future work

  3. Background & motivation • Differences among dialects • Segmental differences • Fricative differences in the time domain (Lee, 2002) • Busan fricatives have shorter frication/aspiration intervals than for Seoul • Fricative differences in the frequency domain (Kim et al., 2002) • The low cutoff frequency of Kyungsang fricatives was higher than for Cholla fricatives (> 1,000 Hz) • Non-segmental or prosodic differences • Intonation or fundamental frequency (F0) contour difference • Intensity contour difference • Segment durational difference • Voice quality difference

  4. Background & motivation • Concatenative text-to-speech (TTS) synthesizers • Concatenation-based • Concatenation units: e.g. diphones • Concatenation units from pre-recorded utterances of a particular dialect • No need for modeling segmental properties (cf. formant-based synthesizers) • Strength/Weakness • Usually single dialect

  5. Background & motivation • To build a multi-dialectal TTS synthesizer • Concatenation units: Multiple dialects • User-selectable dialects • Question: • Scenario A: A multi-dialectal TTS system containing multiple concatenation units from all the dialects involved • Scenario B: Use the concatenation units from a single dialect and simulate the other dialects

  6. Background & motivation • The answer has implications on the cost and the complexity of building multi-dialect TTS systems. • Scenario B • Simpler & cheaper • Need for simulating the segmental/non-segmental aspects of the other dialects involved. • Scenario A may be the ultimate solution • Concatenative TTS systems • Since modeling the segmental aspects of the concatenation units in the frequency domain can be difficult, the non-segmental or prosodic aspects should be manipulated.

  7. Concatenation units from dialect 1 Simulate prosodic aspects Dialect 2 Dialect 3 Dialect 4 Dialect 4 Background & motivation • The imaginary TTS system (Scenario B)

  8. Background & motivation • The questions are;Would the simulated dialects be good enough? In other words, Would the segmental effects be negligible in perceiving the simulated dialects as authentic?

  9. Goal of the current work • The goal is to test the viability of this scenario with an imaginary system: • Simulate Masan dialect with Seoul dialect • The simulated Masan dialect will have • the speech segments of Seoul dialect • the prosody of Masan dialect (F0, intensity, duration) • the voice source of Masan dialect (not tested)

  10. Goal of the current work • The imaginary system would have • the concatenation units from Seoul dialect and • the ‘near-perfect’ prosody-generating module and • have to simulate the other dialects, e.g. Masan dialect • The imaginary TTS system will be implemented with • the recorded utterances of Seoul dialect • the Masan prosody (F0, intensity, duration) from recorded Masan utterances • the voice source of recorded Masan utterances (not tested)

  11. Prosody transfer (PSOLA algorithm) • Three aspects of the prosody • Fundamental frequency (F0) contour • Intensity contour • Segmental durations • Pitch-Synchronous OverLap and Add (PSOLA) algorithm (Mouline & Charpentier, 1990) • Implemented in Praat (Boersma, 2005) • Use of a script for semi-automatic segment-by-segment manipulation (Yoon, 2006)

  12. Prosody transfer (PSOLA algorithm) • PSOLA algorithm • Windowing pitch periods of the original signal • Rearranging windowed pitch periods to • Stretch/shrink the signal (involves adding/deleting windowed pitch periods) • Change, i.e. increase/decrease the F0 of the signal(involves adding/deleting windowed pitch periods)

  13. original waveform windowed waveform 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 shortened waveform waveform with lower F0 1 4 7 10 13 16 19 1 3 5 7 9 11 13 15 17 19 Prosody transfer (PSOLA algorithm)

  14. Prosody transfer (PSOLA algorithm) • Prosody transfer using the PSOLA algorithm • Align segments btw/ Masan and Seoul utterances • Make the segment durations of the two identical • Make the two F0 contours identical • Make the two intensity contours identical

  15. ㅏ ㄹ ㅏ ㅁ Masan “…바람…” stretch shrink ㅏ ㅏ ㅂ ㄹ ㅁ Seoul Prosody transfer (PSOLA algorithm) Align segments btw/ Masan and Seoul utterances Make the segment durations of the two utterances identical

  16. Masan F0 ㅂ ㅏ ㄹ ㅏ ㅁ Masan ㅂ ㅏ ㄹ ㅏ ㅁ Seoul Seoul F0 Prosody transfer (PSOLA algorithm) Make the two F0 contours identical

  17. Masan intensity ㅂ ㅏ ㄹ ㅏ ㅁ Masan ㅂ ㅏ ㄹ ㅏ ㅁ Seoul Seoul intensity Prosody transfer (PSOLA algorithm) Make the two intensity contours identical

  18. Preparation of test stimuli

  19. Preparation of control stimuli

  20. Preparation of experiment stimuli Masan dialect prosody-donor (A) prosody-recipient (B) Seoul dialect prosody-recipient (C) prosody-recipient (D) 바다에 보물섬이 없다 교수님 가시는 길이 구미로… 동대구에 볼 일이 없습니다 쌀 사고 난 후에 와라 바람이 불어서 먼지가 많다 싸기는 해 보여도, 비싸기는 … 서울에 사는 삼촌이 왔다 7 control stimuli (used) 7 test stimuli (used) test stimuli (not used)

  21. Listening test & evaluation • 14 test/control stimuli normalized & randomized • Presented to 4 Masan listeners for magnitude estimation • On a scale of 1 (bad) to 10 (best) • Qualitatively assessed • Used Praat experimentMFC object • Repetition of each stimulus : up to 10 times (User can press “replay” button)

  22. Listening test & evaluation

  23. Listening test & evaluation

  24. Future work • Carefully control the phonological, morphological, and syntactic aspects of the test sentences • Try the voice source (as opposed to the filter) of Masan utterances

  25. Future work • Compare spectra btw/ Masan and Seoul /i/ • window length 50 msec. 바람이 H1 & H2

  26. Original Masan dialect Original Seoul dialect Simulated Masan dialect: Seoul segments + Masan prosody Simulated Masan dialect: Seoul segments + Masan prosody + Masan voice source

  27. Appendix Seoul dialect prosody-donor (A) prosody-recipient (B) Masan dialect prosody-recipient (C) prosody-recipient (D) 바다에 보물섬이 없다 교수님 가시는 길이 구미로… 동대구에 볼 일이 없습니다 쌀 사고 난 후에 와라 바람이 불어서 먼지가 많다 싸기는 해 보여도, 비싸기는 … 서울에 사는 삼촌이 왔다 control stimuli test stimuli test stimuli

  28. References [1] Kyung-Hee Lee, “Comparison of acoustic characteristics between Seoul and Busan dialect on fricatives”, Speech Sciences, Vol.9/3, pp.223-235, 2002. [2] Hyun-Gi Kim, Eun-Young Lee, and Ki-Hwan Hong, “Experimental phonetic study of Kyungsang and Cholla dialect using power spectrum and laryngealfiberscope”, Speech Sciences, Vol.9/2, pp.25-47, 2002. [3] Kyuchul Yoon, “Swapping native and non-native speakers' prosody using PSOLAalgorithm”, Proceedings of the Korean Society of Phonetic Sciences and SpeechTechnology, Spring Conference, pp.77-81, 2006. [4] E. Moulines and F. Charpentier, “Pitch synchronouswaveform processing techniquesfor text-to-speech synthesis using diphones”, Speech Communication, 9:n 5-6, 1990. [5] P. Boersma, “Praat, a system for doing phonetics by computer”, Glot International,Vol.5, 9/10, pp.341-345, 2005.

More Related