260 likes | 765 Views
SIPL – Technion IIT. December 1 2002. Voice Transformation : Speech Morphing. Gidon Porat and Yizhar Lavner. Project Goals. Gradually change a source speaker ’ s voice, to sound like the voice of a target speaker. The inputs : two reference voice signals, one for each speaker.
E N D
SIPL – Technion IIT December 1 2002 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner
Project Goals • Gradually change a source speaker’s voice, to sound like the voice of a target speaker. • The inputs : two reference voice signals, one for each speaker. • The output : N voice signals that gradually change from source to target. source interp target
Applications ● Multimedia and video entertainment: voice Morphing , just like it’s “face” counterpart. While seeing a face gradually changing from one person to another’s (like often done in video clips) ,we could simultaneously hear his voice changing as well. ● Forensic voice identification by synthesis: Identifying a suspect voice by creating voice-bank of different pitch, rate and timbre. Similar method was developed for face recognition to replace police sketch artists.
The Challenges • The source and target voice signals will never be of the same length. A time varying method is needed. • It is needed to change the source voice’s characteristics to those of the target speaker : pitch,duration,spectral parameters. • Produce a natural sounding speech signal. source target
The Challenges cont. Here are two identical words (“shade”) form source and target The target speaker’s word lasts longer then the source speaker’s and it’s “shape” is quite different.
Speech Modeling Sound transmission in the vocal tract can be modeled as sound passing through concatenated loss less acoustic tubes. A known mathematical relation between the areas of these tubes and the vocal tract’s filter, will help in the implementation of our algorithm A1 A2 A3 Ak
Speech Modeling - Synthesis The basic synthesis of digital speech is done by the discrete-time system model.
Prototype Waveform Interpolation sample PWI is a speech coding method, which is based on the presentation of a speech signal, or its residual error function, by a 3-D surface. The creation of such a surface is described by 3 main steps. align interpolate
The Solution Use 3D surfaces that will capture each vocal phoneme’s residual error signal characteristics and interpolate between the two speakers. Unvoiced phonemes are not dealt with due to complexity and the fact that they carry little information about the speaker.
The Algorithm Once the surfaces for both source and target speakers are created (for each phoneme) an interpolated surface is created. Speaker A Intermediate surface + Speaker B The new error function, that will be reconstructed from that surface, will then be the input of a new Vocal Tract Filter.
The Waveform – Intermediate After creating a surface for each speaker’s phoneme, an intermediate surface is created.
The Waveform - Reconstruction The new, intermediate error signal can be evaluated from the new surface by the equation : Assuming the pitch cycle changes slowly in time : And that :
Algorithm – Cont. The areas of the new tube model will be an interpolation between the source’s and the target’s. Once the new Areas are computed the LPC parameters and V(z) can be calculated, and the signal can be synthesized.
The transfer function The factor could be invariant, and then the voice produced will be an intermediate between the two speakers, or it could vary in time, (from at t=0 to at t=T), yielding a gradual change from one voice to the other. In order for one to hear a “linear” change between the source’s and the target’s voices, the coefficient, (the relative part of ) has to vary in time nonlinearly with a fast transition to a = 0.5 and a slow transitions around it.
voiced voiced 2 3 4 1 unvoiced silent The Algorithm – cont. The final and new speech signal is created by concatenating the new vocal phonemes, in order, along with the source’s/target’s unvoiced phonemes and silent periods.
Conclusion The utterances produced by the algorithm were shown to consist of intermediate features of the two speakers. Both intermediate sounds: And gradually changing sounds were produced.
Algorithm’s limitations Although most of the speaker’s individuality is concentrated in the voiced speech segments, degradation could be noticed, when interpolating between two speech signals that differ greatly in their unvoiced speech segments, such as heavy breathing, long periods of silence, etc.
What has been done in the second part of this project? • Basic implementation of reconstruction algorithm. • Interpolation between two surfaces. • Final implementation of the algorithm. • Fixes in the algorithm such as: • Maximization of the crosscoralation between surfaces. • Modifications to the morphing factor.
Future work proposals • The effect of the unvoiced/silent segments of speech on voice individuality. • The effect of the Vocal Tract’s formants shift around the unit cycle on the human’s ear, in order to find a robust mapping of the transform function alpha.