A System for Hybridizing Vocal Performance

A System for Hybridizing Vocal Performance By Kim Hang Lau

Parameters of the singing voice • Parameters of the singing voice can be loosely classified as: • Timbre • Pitch contour • Time contour (rhythm) • Amplitude envelope (projections)

Vocal Modification • Vocal modification refers to the signal processing of live or recorded singing to achieve a different inflection and/or timbre • Commercially available units include • Intonation corrector • Pitch/formant processor • Harmonizer • Vocoder

Objectives • Prototype a system for vocal modification • Modify a source vocal sample to match the time evolution, pitch contour and amplitude envelope of a similarly sung, target vocal sample • Simulates a transfer of singing techniques from a target vocalist to a source vocalist – thus a hybridizing vocal performance

Order of Presentation • System Overview • Individual components • System evaluation • System limitations • Conclusions and recommendations

System Overview • Three components • Pitch-marking • Time-alignment • Time/pitch/amplitude modification engine • Inspired by Verhelst’s prototype system for the post-synchronization of speech utterances

Targeted System Specifications

Component No.1Pitch-marking

Pitch-marks P P’ 5ms 5ms Pitch-marking and Glottal Closure Instants (GCIs) • Information generated from pitch-marking • Pitch period • Amplitude envelope • Voiced/unvoiced segment boundaries

Pitch-marking applying Dyadic Wavelet Transform (DyWT) • Kadambe adapted Mallat’s algorithm for edge detection in image signal to the detection of GCIs in speech signal • He assumed the correlation between edges in image signal and GCIs in speech signal • DyWT computation for dyadic scales 2^3 to 2^5 was sufficient for pitch-marking • If a particular peak detected in DyWT matches for two consecutive scales, starting from a lower scale, that time-instant is taken as a GCI

Mallat Kadambe Original Signal 2^1 2^2 2^3 2^4 2^5 Base-band

The proposed pitch-marking scheme • Detection principle • Detection of the scale that contains the fundamental period • Starting from a higher scale (of lower frequency), there is a considerable jump in frame power when this scale is encountered • Features • 4X decimation to support high sampling rates • Frame based processing and error correction for possible quasi-real-time detection

The proposed pitch-marking system

Comparisons of results with Auto-Tune Proposed system Auto-Tune

Component No.2The Modification Engine

(n) (n) (n) D(n) Time/pitch/amplitude modification engine (n): time-modification factor (n): pitch-modification factor (n): amplitude modification factor D(n): time-warping function

TD-PSOLA(Time-domain Pitch Synchronous Overlap-Add) • Time-domain splicing overlap-add method • Used in prosodic modification of speech

Evaluation of the modification engine Original TD-PSOLA Auto-Tune

Component No.3Time-alignment

Time-alignment • Based on Verhelst’s prototye system that applies Dynamic Time Warping (DTW) • He claimed that the basic local constrain produces the most accurate time-warping path • Exponential increase in computation as length of comparison increases • Accuracy deteriorates as length of comparison increases

Adaptations from Verhelst’s method • Proposed to perform time-alignment on a voiced/unvoiced segmental basis • DTW for voiced segments • Linear Time Warping (LTW) for unvoiced segments • Global constraints are introduced to further reduce computations • Synchronization of voiced/unvoiced segments are required, which is manually edited in current implementation

Manipulation of modification parameters • Simple smoothing of (n), (n) using linear phase FIR low-pass filters are performed before feeding them to the modification engine

The Prototype System

System Evaluation: case 1

System Evaluation: case 2

System Limitations • Segmentation • Lack of a reliable technique for voiced/unvoiced segmentation • Segmentation and classification of different vocal sounds is the key to devise rules for modification • Modification engine • Lack capabilities to handle pitch transition, total dependence to the pitch-marking stage

System Limitations • Pitch-marking • Proposed system lacks robustness • Despite desirable time-response of the wavelet filter bank, its frequency response is not capable of isolating harmonics effectively and efficiently • Time-alignment • The DTW basic local constraint allows infinite time expansion and compression. • This factor often causes distortions in the synthesized vocal sample

Conclusions and Recommendations • Current systems works well for slow and continuous singing • Further improvements on the individual components are recommended to handle greater dynamic changes of the vocal signal, thereby extending the current good results to a wider range of singing styles

Questions & Answers

Wavelet filter bank

Dyadic Spline Wavelet

Wide-band analysis

DTW local constraints

Calculation of pitch-marks

DyWT

A System for Hybridizing Vocal Performance