450 likes | 666 Views
Toward a high-quality singing synthesizer with vocal texture control. Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University, Stanford, CA94305, USA . Score-to-Singing system. Parametric Database. Phoneme. F0 Sound level Duration Vibrato. Score
E N D
Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University, Stanford, CA94305, USA
Score-to-Singing system Parametric Database Phoneme F0 Sound level Duration Vibrato Score Lyrics Singing style Singing voice Rule system Sound synthesis • Acoustic rendering • Co-articulation rules • Lyrics-to-phoneme • Musical rules
General sound synthesis approaches Cons Pros Physical Modeling • analysis/re-synthesis • difficult • invasive measurements • flexible/intuitive control • expressive • co-articulation easy Source-filter Model • less expressive • co-articulation • difficult Spectral Modeling • analysis/re-synthesis • easy
Contributions A pseudo-physical model for singing voice synthesis which • is an approximate physical model. • can generate high-quality non-nasal singing voice. • has analysis/re-synthesis ability. • is computationally affordable. • provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control
Outline • Human voice production system • Synthesis model • Analysis procedure • Vocal texture parametric model • Vocal texture control demo • Contributions and future directions
The human voice production system Nasal sound output Nasal cavity Velum Oral sound output Pharyngeal cavity Oral cavity Vocal folds Tongue hump Lungs Muscle force
Oscillation pattern of the vocal folds Opening period Closing period Close phase Open phase • The oscillation results from the balancing of the subglottal • pressure, the Bernoulli pressure and the elastic restoring force • of the vocal folds. • Prephonatory position : the initial configuration of the • vocal folds before the beginning of oscillation.
Variation of vocal textures Pressed Normal Breathy
Glottal Source Vocal Tract Filter Radiation Aspiration noise Simplified human voice production model • Source-tract interaction: The glottal waveform in general • depends on the vocal tract configuration. • Neglect the source-tract interaction since the glottal impedance • is very high most of the time.
Glottal excitation Filter Voice output Derivative Glottal Wave Vocal Tract Filter Aspiration noise Source-filter type synthesis model Glottal Source Vocal Tract Filter Radiation Aspiration noise
Overview of the proposed synthesis model Glottal excitation Filter Derivative glottal wave Voice output All Pole Filter Transformed Liljencrants-Fant Model Noise Residual Model High-passed aspiration noise
derivative glottal wave from LF model 0.05 pressed phonation amplitude 0 -0.05 0 200 400 600 800 1000 1200 1400 0.05 normal phonation amplitude 0 -0.05 0 200 400 600 800 1000 1200 1400 0.05 breathy phonation amplitude 0 -0.05 0 200 400 600 800 1000 1200 1400 time index Transformed Liljencrants-Fant (LF) model • The transformed LF model controls the wave shape of the derivative • glottal wave via a single parameter, Rd( wave-shape control parameter).
Direct synthesis timing parameters Synthesis: Derivative glottal wave Mapping LF model Rd Transformed Liljencrants-Fant (LF) model • Transformed LF model is an extension of the LF model. It provides • a control interface for the LF model to change the wave shape of the • derivative glottal wave easily. Wave shape control parameter Direct synthesis timing parameters Analysis: Estimated derivative glottal wave LF fitting Mapping-1 Rd
Direct synthesis timing parameters Synthesis: Derivative glottal wave Mapping LF model Rd Transformed Liljencrants-Fant (LF) model • Transformed LF model is an extension of the LF model. It provides • a control interface for the LF model to change the wave shape of the • derivative glottal wave easily. Wave shape control parameter Direct synthesis timing parameters Analysis: Estimated derivative glottal wave LF fitting Mapping-1 Rd
Noise residual model Bn Noise floor Noise residual Gaussian Noise Generator Amplitude Modulation + An GCI L
Vocal tract filter • An all-pole filter. • The vocal tract is assumed to be a series of concatenated uniform • lossless cylindrical acoustic tubes. • Assume that sound waves obey planar propagation along the axis • of the vocal tract. A1 A2 AN Alip glottis lip end 1-kN Ulip Ug -kN -1
Vocal tract filter Kelly-Lochbaum junction : 1-km + + Um Um+1 Scattering coefficient Am -km km Am+1 - - Um+1 Um 1+km • : the propagation time for sound wave to travel one acoustic tube. N : the number of acoustic tubes excluding the glottis and the lip end. • If sampling period T = 2 , the transfer function of the vocal tract • acoustic tubes can be shown to be an Nth order all-pole filter. • The autoregressive coefficients of the vocal tract filter can be • converted to scattering coefficients by Durbin’s method.
Overall synthesis model implementation Degree of breathiness Transformed LF model Ee , F0 Vocal texture model Rd 0.8 + Noise residual model Glottal excitation strength Ee Fundamental frequency F0 Output voice (No noise input)
Analysis procedure Inverse filtered glottal excitation Desired voice recording LF model coefficients Fitting the estimated derivative glottal wave via LF model Source-filter de-convolution De-noising by Wavelet Packet Analysis High-passed aspiration noise
Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) Low-pass filter N+1 order all pole filter Source-filter de-convolution • Synthesis model for analysis KLGLOTT88 (KL) derivative glottal wave Basic Voicing Waveform (a, b, OQ)
N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Source-filter de-convolution • Synthesis model for analysis KLGLOTT88 (KL) derivative glottal wave Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) Low-pass filter
Source-filter deconvolution estimation flowchart Voice signal after removing the low frequency drift GCI detection Phase I One glottal period signal Loop for each period Loop over different OQ values: Vocal tract filter and glottal source estimation via SUMT End Select and store 5 best estimates Loop for each period: Enforce continuity constraints via Dynamic Programming End Phase II Smoothing the vocal tract area by time averaging and linear interpolation Estimated model parameter sequence
N+1 order all pole filter Basic Voicing Waveform (a, b, OQ) Convex optimization formulation Inverse filter • Estimate by minimizing the error between the basic voicing waveform and the estimated one.
Convex optimization formulation • Error for one glottal cycle in vector form, A convex optimization problem Minimize Subject to • L2 norm is used The above problem can be solved by SUMT (sequential unconstrained minimization technique).
Nth order All pole vocal tract filter Basic Voicing Waveform (a, b, OQ) Low-pass filter Effective analysis/re-synthesis Baritone examples: • Normal phonation original KLGLOTT88 • Pressed phonation original KLGLOTT88 KLGLOTT88 (KL) derivative glottal wave
Analysis procedure Inverse filtered glottal excitation Desired voice recording LF model coefficients Fitting the estimated derivative glottal wave via LF model Source-filter de-convolution De-noising by Wavelet Packet Analysis High-passed aspiration noise
De-noising by Wavelet Packet Analysis De-noising by best basis thresholding : • A noisy data record: X = f + W • Transform the noisy data to another basis • via Wavelet Packet Analysis : XB = fB + WB • Thresholding out the smaller coefficients of XB by assuming • that f can be compactly represented in the new basis by • a few large coefficients. • Select the wavelet filter by energy compactness criteria: • 1/(number of coefficients needed to accumulate 0.9 of the total energy).
Analysis procedure Inverse filtered glottal excitation Desired voice recording LF model coefficients Fitting the estimated derivative glottal wave via LF model Source-filter de-convolution De-noising by Wavelet Packet Analysis High-passed aspiration noise
Effective analysis/re-synthesis Baritone examples: • Normal phonation original LF • Pressed phonation original LF
Vocal texture control • The parametric vocal texture control model determines the • parameterizations of the glottal excitation to achieve the desired vocal texture. • Reduce the control complexity by exploring the correlations • between the model parameters. Wave shape control parameter Desired vocal texture Non-breathy mode Transformed LF model ? Rd Glottal excitation strength Ee Rd breathy mode Noise residual model ?
Vocal texture control (non-breathy mode) Pressed and normal modes Wave-shape control parameter Rd and normalized glottal excitation strength Ee are highly correlated.
Vocal texture control (non-breathy mode) Degree of pressness interpolation (apress bpress cpress) (anormal bnormal cnormal) Wave shape control parameter (a, b, c) Glottal excitation Glottal excitation strength Ee Transformed LF model Rd
Vocal texture control (breathy mode) High-passed noise energy • NHR per glottal cycle Glottal excitation strength Ee • NHR is an indicator for the degree of breathiness. • The contour of the noise strength is adjusted by NHR. Glottal excitation Desired vocal texture Transformed LF model NHR + Rd Ee Bn=1 gain Noise residual model An = 2.4138* Bn + 0.213 duty cycle window lag
Overall synthesis model implementation Degree of breathiness Transformed LF model Ee , F0 Vocal texture model Rd Glottal excitation 0.8 + Noise residual model Glottal excitation strength Ee Fundamental frequency F0 Output voice
Contributions A pseudo-physical model for singing voice synthesis which • is an approximate physical model. • can generate high-quality non-nasal singing voice. • has analysis/re-synthesis ability. • is computationally affordable. • provides flexible control of vocal textures. An Automatic analysis procedure for analysis/re-synthesis A parametric model for vocal texture control
Future research • Build a complete score-to-singing system using the proposed • synthesis model. Its associated analysis procedure will be used • to construct the parametric database. • Investigate potential usage of the source-filter deconvolution • algorithm to low-bit rate high quality speech coding. • Explore the application of the analysis procedure on sound • transformation of vocal textures.