1 / 53

HMM-based Singing Voice Synthesis in Mandarin: A Comprehensive Overview

Explore the synthesis unit, question set definition, and evaluation of HMM-based singing voice synthesis system for Mandarin, discussing synthetic voices and future work.

rdoxey
Download Presentation

HMM-based Singing Voice Synthesis in Mandarin: A Comprehensive Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis

  2. Outline • Introduction • Background • Motivation • Related work • Singing voice synthesis system • Evaluation • Discussion • Conclusion • Future work

  3. Introduction- Background • Speech and singing are both important ways to communicate and present emotion • Speech synthesizer can generate fluency and natural speech well, even with personal characteristics • Singing voice synthesis has been one of the emerging and popular research topics recently • enables computers to sing any songs without the need of the actual singing of human

  4. Introduction- Background • There are two main methods in the corpus-based singing synthesis area • sample-based approach: unit-selection • appropriate sub-word units are selected from large speech databases • Pros: high-quality speech at the waveform level • Cons: require huge amount of recorded data, discontinuous, unstable quality, fixed voice characteristics lyrics Note Score editor Synthesis score Synthesis output Sample selection concatenation Singer Library

  5. Introduction- Background • sample-based approach: unit-selection • chosen from singing voice corpus with the lyrics of the song and corresponding MIDI file [Zhou, 2008] • Vocaloid • a singing synthesizer developed by Yamaha Corporation, initial released in January 2004 • Pitch conversion and timbre manipulation to smoothing concatenate samples

  6. Introduction- Background • There are two main methods in the corpus-based singing synthesis area • statistical approach : HMM-based • Parameters model with context-dependent HMMs and waveforms are generated from the HMMs. • Pros: relatively little training data, smooth and stable quality, flexibility to control voice characteristics • Cons: vocoder sound, over-smoothing Singing waveform labels labels parameter extraction parameter generation Acoustic model training Waveform generation Synthesis output Acoustic model parameters Singing parameters

  7. Introduction- Background • statistical approach : HMM-based • Sinsy • A free on-line singing voice synthesis service which provide JapaneseandEnglish version • Users can obtain synthesized singing voices by uploading musical scores represented in MusicXML

  8. Introduction- Background • Another method for singing voice synthesis system • HNM (Harmonic plus Noise Model) • HNM parameters of a source syllable are used to synthesize singing syllables of diverse pitches and durations [Gu, 2008] • Speech-to-singing • Synthesize singing voice by parameters control model from lyrics of a song and its musical score [Akagi, 2007] • lyrics are converted into speech by TTS, then melody control model convert speech signal into singing voice by modifying the acoustic parameters [Cai, 2011]

  9. Introduction- Motivation • In order to synthesize smooth and continuous singing voice, we chose HMM-based method to build our singing voice synthesis system • HMMcan model temporal sequence of singing voice • parameter generation from an HMM composed by concatenation of phoneme HMMs HMM state sequence State duration Spectral and lf0 parameters

  10. Introduction- Improvement in Sinsy • These are a series of papers written by the producer of Sinsy’s team • [An HMM-based Singing Voice Synthesis System,2006] • The first paper about HMM-based singing voice synthesis system • [HMM-based Singing Voice Synthesis System using Pitch-shifted Pseudo Training Data,2010] • To increase the amount of F0 training data, pitch-shifted pseudo data can be prepared by shifting F0 up or down in halftone • [Recent Development of the HMM-based Singing Voice Synthesis System – Sinsy ,2010] • Introduce the free on-line singing voice synthesis service • [Pitch Adaptive Training For HMM-based Singing Voice Synthesis ,2012] • model-level normalization of pitch

  11. Singing voice synthesis system- features extraction • STRAIGHT[H. Kawahara 1997] • A high-quality analysis synthesis method and offers high flexibility in parameter manipulation with no further degradation • extract parameters with relatively good performance in not professional recording environment • Features: Pitch, Smoothed Spectrum, Aperiodic factors Fixed-point analysis F0 extraction Analysis waveform Smoothed spectrum Aperiodic factors F0 Mixed excitation with phase manipulation Synthesis Synthetic waveform

  12. Singing voice synthesis system- Proposed method for Mandarin singing • Speech vs. Singing • Pitch contour • Database, Model definition, question set

  13. Singing voice synthesis system- Proposed method for Mandarin singing • Speech vs. Singing • Music Score • pitch: duration: • key: tempo: beat:

  14. Singing voice synthesis system- Proposed method for Mandarin singing • Different from Sinsy • Language: from Japanese to Mandarin • Database, model definition, question sets • Refinement • Japanese Syllabary – hiragana • Japanese syllables are basically from "consonant + vowel" • only five vowel • Bopomofo • Existing 37 (initials 21, finals 16)

  15. Singing voice synthesis system- Proposed method for Mandarin singing Acoustic parameters Model Question sets linguistic info note info cue info Singing Database Different from Sinsy Different from TTS Only for Mandarin Specially for singing

  16. Singing voice synthesis system- system structure Training phase Singing voice database Excitation parameter extraction Spectral parameter extraction Aperiod parameter extraction Context-dependent HMMs & duration models CART-based state tying label Question set Training of HMM Synthesis phase Musical Score State selection byCART conversion label Excitation generation Synthesis filter Synthesized Singing Voice Parameter generation from HMM Spectral generation Aperiod generation

  17. Singing voice synthesis system- Proposed method for Mandarin singing • Singing Voice Database Construction • Building a singing voice database for training and synthesis • MHMC Singing Voice Database • Mandarin singing Model definition • Initial and final modification • Medial modification • Long duration models • Question sets definition of decision trees • Modification for Mandarin • Refinements • Pitch coverage by pitch-shift pseudo data • Vibrato

  18. Singing voice synthesis system- singing voice database construction • Singing Voice Database Construction • Singing corpus design process Music Score Corpus Songs selection Singing database Selected Scores Selected Scores Phonetic transcription Segmentation by phoneme Singing signal

  19. Singing voice synthesis system- singing voice database construction • Singing Voice Database Construction • Songs selection • Selecting scores • Music book and internet version • Choosing criteria and specialization • Simple and no need many skills • Phone coverage • Digitizing data • format: MusicXML • Transposition to appropriate pitch range

  20. Singing voice synthesis system- Model definition • MusicXML file Sheet Music score MusicXML format Key in Convert MusicXML is an XML-based file format for representing Western musical notation. The format is proprietary, but fully and openly documented.

  21. Singing voice synthesis system- singing voice database construction • Singing Voice Database Construction • Singer selection and data processing • Finding candidates to record demo • 4 candidates • Choosing singer • the accuracy of pitch • timbre • Checking recorded data • noise is not allowed • exceed recording criterion • Segmentation and normalization • Phoneme • Let the energy of singing voice data smaller • avoid singing voice becomes loud suddenly • Pitch scale is too large leading to bad synthesize

  22. Singing voice synthesis system- singing voice database • NCKU Singing Voice Database • We choose the 74 songs depends on the lyrics which can cover all mandarin phonemes

  23. Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

  24. Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

  25. Singing voice synthesis system- Model definition • Initial and final processing • tone • instead of the original tone of words, the main pitch of note is more significant • e.g. 不 speech->bu wuH wuL sing->bu wu • Vowel • We define the phonemes by phonology • The medial with the rime rather than the initial • When yi(ㄧ) 、 wu(ㄨ)、yu(ㄩ) is medial, than medial and rime are collectively known as one kind of final. speech singing

  26. Singing voice synthesis system- Model definition • Initial and final processing • Single initial • A syllable only has initial without finals • followed with an empty rime “帀“ to pronounce • 捲舌音: ㄓㄔㄕㄖ+ zr 平舌音: ㄗㄘㄙ+ sr • Total phonemes are 59 (speech: 66)

  27. Singing voice synthesis system- singing voice database • phonetic coverage • final • initial • final contains medial

  28. Singing voice synthesis system- Model definition • Long duration model • To express well in singing, long duration note is important. • shorter notes will soon be over with no special effects. Long tone is different, it provide a larger space to express. • Lengthen the short duration note cannot present long duration note completely • half or whole note -> Final + “L” 一起飛 飛就飛叫就叫

  29. Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

  30. Singing voice synthesis system- Model definition • Riffs and runs processing • A syllable corresponding to multiple notes • Repeat the last tonal • Pause processing • In order to present the breathing pause or segmented pause when human singing • The singer suspend more than a threshold (> 0.3seconds) • a rest

  31. Singing voice synthesis system- Model definition • Linguistic information • phoneme • current phoneme, { preceding, succeeding } two phonemes • syllable • # of phonemes at {preceding, current, succeeding} syllable • Phrase • # of phonemes/syllables at {preceding, current, succeeding} phrase • song • # of average phonemes/syllables in measure in this song • # of phrases in this song • Riffs and Run

  32. Singing voice synthesis system- Model definition • Singing is the act of producing musical sounds with the voice, and augments regular speech by the use of both tonality and rhythm • Note pitch • Pitches are compared as "higher" and "lower" in the sense associated with musical melodies • Note duration • An amount of time or a particular time interval.  It is the length of a note and one of the bases of rhythm. • Songs structure • what kind of an overall musical form or structure the song adopts • the order of a music score

  33. Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

  34. Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

  35. Singing voice synthesis system- Model definition • User-defined phrase units • phrasing may be necessary for the singer to take catch breaths or to achieve a certain style. • definition in relation to music is ”a short passage or segment, often consisting of four measures or forming part of a smaller/larger unit” • We defined the unit of phrase depend on the song structure. • used in outside label to present breathing pause 4 measures / phrase 2 measures / phrase

  36. Singing voice synthesis system- Model definition • Note Calculation • the basic information is not enough to present one note completely • Relative pitch • means difference between the key note and the current note • Key note depends on numbers of sharps or flats • Note position • different note positions in the measure or phrase may have different expression due to breathing • unit: note, 0.1 second,thirty-second note, % • Note length • 0.1 second(absolute pitch), thirty-second note(relative length)

  37. Singing voice synthesis system- Model definition text MusicXML Note information Word to Phone Extract Scores Information cue information Initial and final Processing wav Song Settings Note Absolute Pitch Note Type Measure Long duration Processing transcription Note Calculation Riffs and runs Processing User-defined phrase units Pause Processing Song Structure Note Pitch Note Duration linguistic information Label

  38. Singing voice synthesis system- Model definition • Note information • Note Pitch • Absolute pitch (C0-G9), relative pitch(0-11), the difference pitch between previous & current / current & next • Note Duration • Length of note by syllable, thirty-second note, 0.1 second • Song Structure • Beat: 2/4, 3/4, 4/4 • Tempo: 90, 100, 120 • key • Position • Count by note, 0.1 second,thirty-second note, percentage in the measure/phrase • Number of phrases

  39. Singing voice synthesis system- Question sets definition • Question sets definition for singing model clustering (1) Phoneme (current and { preceding, succeeding } two phonemes) • Final • With or without medial • Initial • Initials pronunciation category • Finals pronunciation category (2) Note • Pitch • Tempo • Beat • Duration • Position • (3) phrase • # of phonemes/syllables • preceding, current, succeeding phrase • (4) song • # of phonemes/syllables • # of phrases

  40. Singing voice synthesis system- Refinement • Pitch-shift pseudo data • Pitch coverage • using the nearby notes from other songs and shift to corresponding Hertz

  41. Singing voice synthesis system- Refinement

  42. Evaluation • Experimental Conditions • Database condition • Mel-cepstral analysis condition

  43. Evaluation • Experiments settings • Baseline • RQ : Reduced Question sets duplicate questions, indirect questions, relative questions • PS : Pitch-shift pseudo data • VP : Vibrato post-processing

  44. Evaluation- Subjective evaluation • Pitch contour • Synthesized (baseline) vs. Music score • Synthesized (baseline) vs. Original singing

  45. Evaluation- Subjective evaluation • Mean Opinion Scores(MOS) • 10 synthesize songs • 12 subjects • Quality and Intelligibility evaluation • ABX test • A subject is presented with two known samples (A, the reference, and B, the alternative. X is randomly selected from A and B, and the subject identifies X as being either A or B)

  46. Evaluation- Subjective • Quality evaluation Intelligibility evaluation

  47. Demo • Outside Test baseline+QR baseline baseline+QR+PS 娃娃哭了 叫媽媽 推你摔下 你又站起來

  48. Evaluation- Subjective • The score of quality and intelligibility is lower than baseline • The question set we reduced including the important information to classify • Too few question • 5364->1257 • Find out the better version of reduced question sets

  49. Preference test • Natural- Testing vibrato • different pitch and situation corresponding to different settings • Vibrato is not essential in children’ songs original vibrato

  50. Discussion • Singing corpus quality • Recording in professional environment • Singer’s timbre • Context factor coverage • Too blurred • Not enough training corpus • modeled with priority of singing characteristics

More Related