280 likes | 289 Views
Explore simulating Masan dialect with Seoul dialect through prosody transfer using the PSOLA algorithm for multi-dialectal TTS systems. Test the viability of simulating dialects combining speech segments and prosody elements.
E N D
Dialect Simulation through Prosody Transfer:A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam University The Autumn Conference of The Association of Modern British & American Language & Literature University of Ulsan, 2006. 11. 04
Table of Contents • Background & motivation • Goal of the current work • Prosody transfer (PSOLA algorithm) • Preparation of stimuli • Listening test & evaluation • Future work
Background & motivation • Differences among dialects • Segmental differences • Fricative differences in the time domain (Lee, 2002) • Busan fricatives have shorter frication/aspiration intervals than for Seoul • Fricative differences in the frequency domain (Kim et al., 2002) • The low cutoff frequency of Kyungsang fricatives was higher than for Cholla fricatives (> 1,000 Hz) • Non-segmental or prosodic differences • Intonation or fundamental frequency (F0) contour difference • Intensity contour difference • Segment durational difference • Voice quality difference
Background & motivation • Concatenative text-to-speech (TTS) synthesizers • Concatenation-based • Concatenation units: e.g. diphones • Concatenation units from pre-recorded utterances of a particular dialect • No need for modeling segmental properties (cf. formant-based synthesizers) • Strength/Weakness • Usually single dialect
Background & motivation • To build a multi-dialectal TTS synthesizer • Concatenation units: Multiple dialects • User-selectable dialects • Question: • Scenario A: A multi-dialectal TTS system containing multiple concatenation units from all the dialects involved • Scenario B: Use the concatenation units from a single dialect and simulate the other dialects
Background & motivation • The answer has implications on the cost and the complexity of building multi-dialect TTS systems. • Scenario B • Simpler & cheaper • Need for simulating the segmental/non-segmental aspects of the other dialects involved. • Scenario A may be the ultimate solution • Concatenative TTS systems • Since modeling the segmental aspects of the concatenation units in the frequency domain can be difficult, the non-segmental or prosodic aspects should be manipulated.
Concatenation units from dialect 1 Simulate prosodic aspects Dialect 2 Dialect 3 Dialect 4 Dialect 4 Background & motivation • The imaginary TTS system (Scenario B)
Background & motivation • The questions are;Would the simulated dialects be good enough? In other words, Would the segmental effects be negligible in perceiving the simulated dialects as authentic?
Goal of the current work • The goal is to test the viability of this scenario with an imaginary system: • Simulate Masan dialect with Seoul dialect • The simulated Masan dialect will have • the speech segments of Seoul dialect • the prosody of Masan dialect (F0, intensity, duration) • the voice source of Masan dialect (not tested)
Goal of the current work • The imaginary system would have • the concatenation units from Seoul dialect and • the ‘near-perfect’ prosody-generating module and • have to simulate the other dialects, e.g. Masan dialect • The imaginary TTS system will be implemented with • the recorded utterances of Seoul dialect • the Masan prosody (F0, intensity, duration) from recorded Masan utterances • the voice source of recorded Masan utterances (not tested)
Prosody transfer (PSOLA algorithm) • Three aspects of the prosody • Fundamental frequency (F0) contour • Intensity contour • Segmental durations • Pitch-Synchronous OverLap and Add (PSOLA) algorithm (Mouline & Charpentier, 1990) • Implemented in Praat (Boersma, 2005) • Use of a script for semi-automatic segment-by-segment manipulation (Yoon, 2006)
Prosody transfer (PSOLA algorithm) • PSOLA algorithm • Windowing pitch periods of the original signal • Rearranging windowed pitch periods to • Stretch/shrink the signal (involves adding/deleting windowed pitch periods) • Change, i.e. increase/decrease the F0 of the signal(involves adding/deleting windowed pitch periods)
original waveform windowed waveform 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 shortened waveform waveform with lower F0 1 4 7 10 13 16 19 1 3 5 7 9 11 13 15 17 19 Prosody transfer (PSOLA algorithm)
Prosody transfer (PSOLA algorithm) • Prosody transfer using the PSOLA algorithm • Align segments btw/ Masan and Seoul utterances • Make the segment durations of the two identical • Make the two F0 contours identical • Make the two intensity contours identical
ㅂ ㅏ ㄹ ㅏ ㅁ Masan “…바람…” stretch shrink ㅏ ㅏ ㅂ ㄹ ㅁ Seoul Prosody transfer (PSOLA algorithm) Align segments btw/ Masan and Seoul utterances Make the segment durations of the two utterances identical
Masan F0 ㅂ ㅏ ㄹ ㅏ ㅁ Masan ㅂ ㅏ ㄹ ㅏ ㅁ Seoul Seoul F0 Prosody transfer (PSOLA algorithm) Make the two F0 contours identical
Masan intensity ㅂ ㅏ ㄹ ㅏ ㅁ Masan ㅂ ㅏ ㄹ ㅏ ㅁ Seoul Seoul intensity Prosody transfer (PSOLA algorithm) Make the two intensity contours identical
Preparation of experiment stimuli Masan dialect prosody-donor (A) prosody-recipient (B) Seoul dialect prosody-recipient (C) prosody-recipient (D) 바다에 보물섬이 없다 교수님 가시는 길이 구미로… 동대구에 볼 일이 없습니다 쌀 사고 난 후에 와라 바람이 불어서 먼지가 많다 싸기는 해 보여도, 비싸기는 … 서울에 사는 삼촌이 왔다 7 control stimuli (used) 7 test stimuli (used) test stimuli (not used)
Listening test & evaluation • 14 test/control stimuli normalized & randomized • Presented to 4 Masan listeners for magnitude estimation • On a scale of 1 (bad) to 10 (best) • Qualitatively assessed • Used Praat experimentMFC object • Repetition of each stimulus : up to 10 times (User can press “replay” button)
Future work • Carefully control the phonological, morphological, and syntactic aspects of the test sentences • Try the voice source (as opposed to the filter) of Masan utterances
Future work • Compare spectra btw/ Masan and Seoul /i/ • window length 50 msec. 바람이 H1 & H2
Original Masan dialect Original Seoul dialect Simulated Masan dialect: Seoul segments + Masan prosody Simulated Masan dialect: Seoul segments + Masan prosody + Masan voice source
Appendix Seoul dialect prosody-donor (A) prosody-recipient (B) Masan dialect prosody-recipient (C) prosody-recipient (D) 바다에 보물섬이 없다 교수님 가시는 길이 구미로… 동대구에 볼 일이 없습니다 쌀 사고 난 후에 와라 바람이 불어서 먼지가 많다 싸기는 해 보여도, 비싸기는 … 서울에 사는 삼촌이 왔다 control stimuli test stimuli test stimuli
References [1] Kyung-Hee Lee, “Comparison of acoustic characteristics between Seoul and Busan dialect on fricatives”, Speech Sciences, Vol.9/3, pp.223-235, 2002. [2] Hyun-Gi Kim, Eun-Young Lee, and Ki-Hwan Hong, “Experimental phonetic study of Kyungsang and Cholla dialect using power spectrum and laryngealfiberscope”, Speech Sciences, Vol.9/2, pp.25-47, 2002. [3] Kyuchul Yoon, “Swapping native and non-native speakers' prosody using PSOLAalgorithm”, Proceedings of the Korean Society of Phonetic Sciences and SpeechTechnology, Spring Conference, pp.77-81, 2006. [4] E. Moulines and F. Charpentier, “Pitch synchronouswaveform processing techniquesfor text-to-speech synthesis using diphones”, Speech Communication, 9:n 5-6, 1990. [5] P. Boersma, “Praat, a system for doing phonetics by computer”, Glot International,Vol.5, 9/10, pp.341-345, 2005.