540 likes | 679 Views
Design and Implementation of Voice Conversion Application (VOCAL). Elizabeth Kwan (26406025) Supervised by: Ms. Liliana, M.Eng Mr. Resmana Lim, M.Eng. A method to transform the input speech signal such that the output signal will be perceived as produced by another speaker. ?.
E N D
Design and Implementation of Voice Conversion Application (VOCAL) Elizabeth Kwan (26406025) Supervised by: Ms. Liliana, M.Eng Mr. Resmana Lim, M.Eng
A method to transform the input speech signal such that the output signal will be perceived as produced by another speaker ? DEFINITIONWhat is Voice Conversion?
Rapid development in speech technology • Speech recognition and text-to-speech have been the priorities in research efforts to improve human-machine (computer) interaction • Improve the naturalness of human-machine (computer) interaction • Voice conversion used in personification of speech enabled system ? BACKGROUNDWhy Voice Conversion?
GENERAL : • Format : wave file (.wav), single channel (mono) INPUT : • Source speaker and target speaker which speaks same utterances • Home recording • One person with minimal noise (no background sound) • For speech only ? SCOPE & LIMITATIONScope and limitation of project
PROCESS : • Not real-time, pre-record speech needed • Text-dependent OUTPUT • Output signal will be perceived as produced by another speaker, judge by subjectivity of human auditory perception • Dialect not included ? SCOPE & LIMITATIONScope and limitation of project
Test using Mean Opinion Score (MOS) • Developed in .NET environment (C# .NET Visual Studio 2005) ? SCOPE & LIMITATIONScope and limitation of project
Difference system conversion used difference methods General system: • A method to represent the speaker specific characteristics of the speech waveform • A method to map the source and the target acoustical spaces • A method to modify the characteristics of the source speech using the mapping obtained in previous step ? VOICE CONVERSION METHODBrief explanation on Voice Conversion
? VOICE CONVERSION METHODPage 33
SEGMENTATION ANALYSIS or MODELING TRANSFORMATION SYNTHESIS ? VOICE CONVERSION METHODMain Process (Flow Chart see Page 30)
Complexity of human language Speech is more than sequences of phones that forms words and sentences. It carries information (rhythm, intonation, stress of words, etc) This information is varied from one person to the others The infinite variety raised the application complexity, especially in segmentation ? WHY IT IS DIFFICULT?External Problems
Speaker Variability Unique voice. Speech generated from one person may varied too - Realization - Speaking style - Sex of speaker - Anatomy of vocal tract - Speed of speech - Dialects ? WHY IT IS DIFFICULT?External Problems
Digital form only contains information of amplitude per periods • Amplitude can not directly used to determined the speech parameters (problems for analysis process) • Manipulate (add or delete) some part of the sound would effect to whole sound ? WHY IT IS DIFFICULT?Internal Problems
SEGMENTATION ANALYSIS or MODELING TRANSFORMATION SYNTHESIS ? VOICE CONVERSION METHODMain Process (Flow Chart see Page 30)
It is difficult to process entire phrase as tone, pitch, and other characteristics may diverse over the whole signal • Split base on syllable • Use end-point detection methods, combination of volume (two volume threshold) and zero-crossing rate (ZCR) ? SEGMENTATIONFlow Chart see Page 34
Volume Loudness of audio signal • Zero-Crossing Rate (ZCR) Rate where signal change from positive to negative, and vise versa ? SEGMENTATIONFlow Chart see Page 34
? SEGMENTATIONFlow Chart see Page 34
SEGMENTATION ANALYSIS or MODELING TRANSFORMATION SYNTHESIS ? VOICE CONVERSION METHODMain Process (Flow Chart see Page 30)
ANALYSIS or MODELING Linear Predictive Coding Pitch Period Computation ? ANALYSIS OR MODELINGMain Process (Flow Chart see Page 36)
ANALYSIS or MODELING Linear Predictive Coding Pitch Period Computation ? ANALYSIS OR MODELINGMain Process (Flow Chart see Page 36)
? ANALYSIS OR MODELINGModeling Vocal Tract
? ANALYSIS OR MODELINGModeling Vocal Tract Source : signal x(t) [excitation signal] Filter : linear time invariant h(t)[transfer function] Speech : convolution of source and filter y(t) = x(t) * h(t)
? ANALYSIS OR MODELINGModeling Vocal Tract De-convolution needed Use of LPC methods predicting a sample of a speech signal based on several previous samples
? ANALYSIS OR MODELINGLinear Predictive Coding
ANALYSIS or MODELING Linear Predictive Coding Pitch Period Computation ? VOICE CONVERSION METHODMain Process (Flow Chart see Page 36)
Pitch Period Computation Pitch Analysis Glottal Pulse Computation Pitch Tier Computation ? VOICE CONVERSION METHODMain Process (Flow Chart see Page 36)
Pitch Analysis Based on autocorrelation methods (Boersma 1993) ? ANALYSIS OR MODELINGPitch Period Computation
Glottal Pulse Computation Repeated pattern of voiced sound τ : glottal pulse ? ANALYSIS OR MODELINGPitch Period Computation
Pitch Tier Calculation total points according to total voiced frames from pitch contour obtained from previous step ? ANALYSIS OR MODELINGPitch Period Computation
SEGMENTATION ANALYSIS or MODELING TRANSFORMATION Synthesis ? VOICE CONVERSION METHODMain Process (Flow Chart see Page 30)
? TRANSFORMATIONTransform speech parameter obtained
SEGMENTATION ANALYSIS or MODELING TRANSFORMATION SYNTHESIS ? SYNTHESISMain Process (Flow Chart see Page 30)
Use of LPC Filter method to reconstruct transformed speech ? SYNTHESISFlow Chart see Page 46
? EXPERIMENTAL RESULT
? TESTINGEffect of choice of hardware used to record
? TESTINGEffect of choice of hardware used to record
? TESTINGEffect of choice of hardware used to record
Speech : “Hai” from 4 difference speakers ? TESTINGTest on segmentation
Speech : “Hai” from 4 (four) difference speakers ? TESTINGTest on segmentation
Speech : “Hai” from 4 (four) difference speakers Percentage result: For speech with only 1 (one) syllable : 100% success ? TESTINGTest on segmentation
Speech : “Saya” from 4 difference speakers ? TESTINGTest on segmentation
Speech : “Saya” from 4 difference speakers ? TESTINGTest on segmentation
Speech : “Saya” from 4 (four) difference speakers Percentage result: For speech with 2 (two) syllables without paused : 0% success (All detect as 1 (one) syllable only) But it works good in the application : 100% success ? TESTINGTest on segmentation
Speech : “Sistem Cerdas” from 4 difference speakers ? TESTINGTest on segmentation
Speech : “Sistem Cerdas” from 4 difference speakers ? TESTINGTest on segmentation
Speech : “Sistem Cerdas” from 4 (four) difference speakers Percentage result: For speech with more complex forms : 50% success Related to Speaker Variability ? TESTINGTest on segmentation
? TESTINGTest on pitch modification
Average percentage result: 98.67 % ? TESTINGTest on pitch modification
Similarity (based on human auditory perception) • Test on 20 peoples, 5 utterances • Overall result : 3.71 of 5.0 ? TESTINGSubjectivity Test
Based on gender • Test on 22 peoples, 2 utterances. • 4 combinations gender for each utterance ? TESTINGSubjectivity Test
Similarity of speaker characteristic • Test on 22 peoples, 5 utterances • Overall result : 3.64 of 5.0 ? TESTINGSubjectivity Test