790 likes | 1.29k Views
Voice Conversion. Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012. E.Godoy Biography. United States (Rhode Island): Native Hometown: Middletown, RI Undergrad & Masters at MIT (Boston Area) France ( Lannion ): 2007-2011 Worked on my PhD at Orange Labs
E N D
Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012
E.Godoy Biography • United States (Rhode Island): Native • Hometown: Middletown, RI • Undergrad & Masters at MIT (Boston Area) • France (Lannion): 2007-2011 • Worked on my PhD at Orange Labs • Iraklio: Current • Work at FORTH with Prof Stylianou E.Godoy, Voice Conversion
Professional Background • B.S. &M.Eng from MIT, Electrical Eng. • Specialty in Signal Processing • Underwater acoustics: target physics, environmental modeling, torpedo homing • Antenna beamforming (Masters): wireless networks • PhD in Signal Processing at Orange Labs • Speech Processing: Voice Conversion • Speech Synthesis Team (Text-to-Speech) • Focus on Spectral Envelope Transformation • Post-Doctoral Research at FORTH • LISTA: Speech in Noise & Intelligibility • Analyses of Human Speaking Styles (e.g. Lombard, Clear) • Speech Modifications to Improve Intelligibility E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Conversion Results • Objective Metrics & Subjective Evaluations • Sound Samples • Summary & Conclusions E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion E.Godoy, Voice Conversion
Voice Conversion (VC) • Transform the speech of a (source) speaker so that it sounds like the speech of a different (target) speaker. Voice Conversion This is awesome! This is awesome! He sounds like me! ??? source target E.Godoy, Voice Conversion
Context: Speech Synthesis • Increase in applications using speech technologies • Cell phones, GPS, video gaming, customer service apps… • Information communicated through speech! • Text-to-Speech (TTS) Synthesis • Generate speech from a given text Insert your card. Turn left! This is Abraham Lincoln speaking… Ha ha! Next Stop: Lannion Text-to-Speech! 7 E.Godoy, Voice Conversion E.Godoy, Guest Lecture December 11, 2012
Text-to-Speech (TTS) Systems • TTS Approaches • Concatenative: speech synthesized from recorded segments • Unit-Selection: parts of speech chosen from corpora & strung together High-quality synthesis, but need to record & process corpora • Parametric: speech generated from model parameters • HMM-based: speaker models built from speech using linguistic info Limited quality due to simplified speech modeling & statistical averaging Concatenative or Paramateric? E.Godoy, Voice Conversion
Text-to-Speech (TTS) Example voice E.Godoy, Voice Conversion
Voice Conversion: TTS Motivation • Concatenative speech synthesis • High-quality speech • But, need to record & process a large corpora for each voice • Voice Conversion • Create different voices by speech-to-speech transformation • Focus on acoustics of voices E.Godoy, Voice Conversion
What gives a voice an identity? • “Voice” notion of identity (voice rather than speech) • Characterize speech based on different levels • Segmental • Pitch – fundamental frequency • Timbre – distinguishes between different types of sounds • Supra-Segmental • Prosody – intonation & rhythm of speech E.Godoy, Voice Conversion
Goals of Voice Conversion • Synthesize High-Quality Speech • Maintain quality of source speech (limit degradations) • Capture Target Speaker Identity • Requires learning between source & target features Difficult task! • significant modifications of source speech needed that risk severely degrading speech quality… E.Godoy, Voice Conversion
Stages of Voice Conversion • Key Parameter: the spectral envelope (relation to timbre) 1) Analysis, 2) Learning, 3) Transformation PARAMETER EXTRACTION TARGET speech • LEARNING • Generating Acoustic Feature Spaces • Establish Mappings between • Source & Target Parameters CORPORA PARAMETER EXTRACTION SOURCE speech • TRANSFORMATION • Classify Source Parameters • in Feature Space • Apply Transformation Function Converted Speech SYNTHESIS E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling E.Godoy, Voice Conversion
The Spectral Envelope • Spectral Envelope: curve approximating the DFT magnitude • Related to voice timbre, plays a key role in many speech applications: • Coding, Recognition, Synthesis, Voice transformation/conversion • Voice Conversion: important for both speech quality and voice identity E.Godoy, Voice Conversion
Spectral Envelope Parameterization • Two common methods 1) Cepstrum - Discrete Cepstral Coefficients - Mel-Frequency Cepstral Coefficients (MFCC) change the frequency scale to reflect bands of human hearing 2) Linear Prediction (LP) - Line Spectral Frequencies (LSF) E.Godoy, Voice Conversion
Standard Voice Conversion • Parallel corpora: source & target utter same sentences • Parameters are spectral features (e.g. vectors of cepstral coefficients) • Alignment of speech frames in time Standard: Gaussian Mixture Model Extract target parameters Corpora (Parallel) Align source & target frames LEARNING Extract source parameters Converted Speech synthesis TRANSFORMATION Focus: Learning & Transforming the Spectral Envelope E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Formulation • Limitations • Acoustic mappings between source & target parameters • Over-smoothing of the spectral envelope • Proposed: Dynamic Frequency Warping + Amplitude Scaling E.Godoy, Voice Conversion
Gaussian Mixture Model for VC • Origins: • Evolved from "fuzzy" Vector Quantization (i.e. VQ with "soft" classification) • Originally proposed by [Stylianou et al; 98] • Joint learning of GMM (most common) by [Kain et al; 98] • Underlying principle: • Exploit joint statistics exhibited by aligned source & target frames • Methodology: • Represent distributions of spectral feature vectors as mix of Q Gaussians • Transformation function then based on MMSE criterion E.Godoy, Voice Conversion
GMM-based Spectral Transformation 1) Align N spectral feature vectors in time. (discrete cepstral coeffs) 2) Represent PDF of vectors as mixture of Q multivariate Gaussians Learn from Expectation Maximization (EM) on Z 3) Transform source vectors using weighted mixture of Maximum Likelihood (ML) estimator for each component. : probability source frame belongs to acoustic class described by component q (calculated in Decoding) E.Godoy, Voice Conversion
GMM-Transformation Steps 1) Source frame want to estimate target vector: 2) Classify calculate : probability source frame belongs to acoustic class described by component q (Decoding step) 3) Apply transformation function: weighted sum ML estimator for class E.Godoy, Voice Conversion
Acoustically Aligning Source & Target Speech • First question: are the acoustic events from the source & target speech appropriately associated? Time alignment Acoustic alignment ??? • One-to-Many Problem • Occurs when single acoustic event of source aligned to multiple acoustic events of target [Mouchtaris et al; 07] • Problem is that cannot distinguish distinct target events given only source information Source: one single event As , Target: two distinct events Bt, Ct [As ; Bt] [As ; Ct] Joint Frames Acoustic Space Bt As Ct E.Godoy, Voice Conversion
Cluster by Phoneme (“Phonetic GMM”) • Motivation:eliminate mixtures to alleviate one-to-many problem! • Introduce contextual information • Phoneme (unit of speech describing linguistic sound): /a/,/e/,/n/, etc • Formulation:cluster frames according to phoneme label Each Gaussian class q then corresponds to a phoneme: • Outcome: error indeed decreases using classification by phoneme • Specifically, errors from one-to-many mappings reduced! E.Godoy, Voice Conversion
Still GMM Limitations for VC • Unfortunately, converted speech quality poor original converted • What about the joint statistics in the GMM? • Difference between GMM and VQ-type (no joint stats) transformation? joint statistics No significant difference! (originally shown [Chen et al; 03]) E.Godoy, Voice Conversion
"Over-Smoothing" in GMM-based VC • The Over-Smoothing Problem: (Explains poor quality!) • Transformed spectral envelopes using GMM are "overly-smooth" • Loss of spectral details converted speech "muffled", "loss of presence" • Cause:(Goes back to those joint statistics…) • Low inter-speaker parameter correlation (weak statistical link exhibited by aligned source & target frames) class mean Frame-level alignment of source & target parameters not effective! E.Godoy, Voice Conversion
“Spectrogram” Example (GMM over-smoothing) Spectral Envelopes across a sentence. E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Related work • DFWA description E.Godoy, Voice Conversion
Dynamic Frequency Warping (DFW) • Dynamic Frequency Warping [Valbret; 92] • Warp source spectral envelope in frequency to resemble that of target • Alternative to GMM No transformation with joint statistics! • Maintains spectral details higher quality speech • Spectral amplitude not adjusted explicitly poor identity transformation E.Godoy, Voice Conversion
Spectral Envelopes of a Frame: DFW E.Godoy, Voice Conversion
Hybrid DFW-GMM Approaches • Goal to adjust spectral amplitude after DFW • Combine DFW- and GMM- transformed envelopes • Rely on arbitrary smoothing factors [Toda; 01],[Erro; 10] • Impose compromise between • maintaining spectral details (via DFW) • respecting average trends (via GMM) E.Godoy, Voice Conversion
Proposed Approach: DFWA • Observation: Do not need the GMM to adjust amplitude! • Alternatively, can use a simple amplitude correction term Dynamic Frequency Warping with Amplitude Scaling (DFWA) E.Godoy, Voice Conversion
Levels of Source-Target Parameter Mappings (3) Class-level: Global (DFWA) (1) Frame-level: Standard (GMM) E.Godoy, Voice Conversion
Recall Standard Voice Conversion • Parallel corpora: source & target utter same sentences • Alignment of individual source & target frames in time • Mappings between acoustic spaces determined on frame level Extract target parameters Corpora (Parallel) Align source & target frames LEARNING Extract source parameters Converted Speech synthesis TRANSFORMATION E.Godoy, Voice Conversion
DFW with Amplitude Scaling (DFWA) • Associate source & target parameters based on class statistics Unlike typical VC, no frame alignment required in DFWA! • Define Acoustic Classes • DFW • Amplitude Scaling Extract target parameters Clustering Corpora Converted Speech synthesis DFW Amplitude Scaling Extract source parameters Clustering E.Godoy, Voice Conversion
Defining Acoustic Classes • Acoustic classes built using clustering, which serves 2 purposes: • Classify individual source or target frames in acoustic space • Associate source & target classes • Choice of clustering approach depends on available information: • Acoustic information: • With aligned frames: joint clustering (e.g. joint GMM or VQ) • Without aligned frames: independent clustering & associate classes (e.g. with closest means) • Contextual information: use symbolic information (e.g. phoneme labels) • Outcome: q=1,2…Q acoustic classes in a one-to-one correspondence E.Godoy, Voice Conversion
DFW Estimation • Compares distributions of observed source & target spectral envelope peak frequencies (e.g. formants) • Global vision of spectral peak behavior (for each class) • Only peak locations (and not amplitudes) considered • DFW function: piecewise linear • Intervals defined by aligning maxima in spectral peak frequency distributions • Dijkstra algorithm: min sum abs diff between target & warped source distributions E.Godoy, Voice Conversion
DFW Estimation • Clear global trends (peak locations) • Avoid sporadic frame-to-frame comparisons • Dijkstra algorithm selects pairs • DFW statistically aligning most probable spectral events (peak locations) • Amplitude scaling then adjusts differences between the target & warped source spectral envelopes E.Godoy, Voice Conversion
Spectral Envelopes of a Frame E.Godoy, Voice Conversion
Amplitude Scaling • Goal of amplitude scaling: Adapt frequency-warped source envelopes to better resemble those of target • estimation using statistics of target and warped source data • For class q, amplitude scaling function defined by: • Comparison between source & target data strictly on acoustic class-level • Difference in average log spectra • No arbitrary weighting or smoothing! E.Godoy, Voice Conversion
DFWA Transformation • Transformation(for source frame n in class q): • In discrete cepstral domain: (log spectrum parameters) Average target envelope respected ("unbiased estimator") Captures timbre! Spectral details maintained: difference between frame realization & average Ensures Quality! E.Godoy, Voice Conversion
Spectral Envelopes of a Frame E.Godoy, Voice Conversion
Spectral Envelopes across a Sentence zoom E.Godoy, Voice Conversion
Spectral Envelopes across a Sound • DFWA looks like natural speech • can see warping appropriately shifting source events • overall, DFWA more closely resembles target • Important to note that examining spectral envelope evolution in time very informative! • Can see poor quality with GMM right away (less evident w/in frame) E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Conversion Results • Objective Metrics & Subjective Evaluations • Sound Samples E.Godoy, Voice Conversion
Formal Evaluations • Speech Corpora & Analysis • CMU ARCTIC, US English speakers (2 male, 2 female) • Parallel annotated corpora, speech sampled at 16kHz • 200 sentences for learning, 100 sentences for testing • Speech analysis & synthesis: Harmonic model • All spectral envelopes discrete cepstral coefficients (order 40) • Spectral envelope transformation methods: • Phonetic & Traditional GMMs (diagonal covariance matrices) • DFWA (classification by phoneme) • DFWE (Hybrid DFW-GMM, E-energy correction) • source (no transformation) E.Godoy, Voice Conversion
Objective Metrics for Evaluation • Mean Squared Error (standard) alone is not sufficient! • Does not adequately indicate converted speech quality • Does not indicate variance in transformed data • Variance Ratio (VR) • Average ratio of the transformed-to-target data variance • More global indication of behavior Is the transformed data varying from the mean? Does the transformed data behave like the target data? E.Godoy, Voice Conversion
Objective Results • GMM-based methods yield lowest MSE but no variance • Confirms over-smoothing • DFWA yields higher MSE but more natural variations in speech • VR of DFWA directly from variations in the warped source envelopes (not model adaptations!) • Hybrid (DFWE) in-between DFWA & GMM(as expected) • Compared to DFWA, VR notably decreased after energy correction filter • Different indications from MSE & VR:both important, complementary E.Godoy, Voice Conversion
DFWA for VC with Nonparallel Corpora • Parallel Corpora: big constraint ! • Limits the data that can be used in VC • DFWA for Nonparallel corpora • Nonparallel Learning Data: 200 disjoint sentences • DFWA equivalent for parallel or nonparallel corpora! E.Godoy, Voice Conversion
Implementation in a VC System • Converted speech synthesis • Pitch modification using TD-PSOLA (Time-Domain Pitch Synchronous Overlap Add) • Pitch modification factor determined by classic linear transformation • Frame synthesis • voiced frames: • harmonic amplitudes from transformed spectral envelopes • harmonic phases from nearest-neighbor source sampling • unvoiced frames: white noise through AR filter Spectral Envelope Transformation Speech Frame Synthesis (pitch modification) Speech Analysis (Harmonic) TD-PSOLA (pitch modification) source speech converted speech E.Godoy, Voice Conversion
Subjective Testing • Listeners evaluate • Quality of speech • Similarity of voice to the target speaker • Mean Opinion Score • MOS scale from 1-5 E.Godoy, Voice Conversion