Advanced Prosody Modeling for Text-To-Speech Synthesis

Sub-Project IProsody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang

Outline • Members • Theme of Sub-project I • Research Roadmap • Current Achievements • Research Infrastructure • Future Direction

Members Sin-Horng Chen Professor (PI) NCTU Chiu-yu Tseng Professor & Research Fellow (Co-PI) Academia Sinica Yih-Ru Wang, Associate Professor (Co-PI) , NCTU Yuan-Fu Liao Assistant Professor (Co-PI) , NTUT Lin-shan Lee Professor , NTU Hsin-min Wang Associate Research Fellow Academia Sinica

Theme of Sub-Project I Hierarchical modeling of fluent prosody Latent Factor-based pitch contour model Mean model: Prosody Analysis and Modeling Shape model: Prosodic model-based tone recognizer Tone Sandhi Tone Behavior and Modeling High performance TTS Applications inText-to-speech Synthesis Applications inSpeech/Speaker Recognition Speaker recognition Less breaks Fast speakers Slow speakers More breaks

Research Focus • How to analyze and model fluent speech prosody • Approach 1: Hierarchical modeling of fluent speech prosody • Develop a hierarchical prosody framework of fluent speech • Construct modular acoustic models for: (1) F0 contours, (2) duration patterns, (3) Intensity distribution and (4) boundary breaks • Approach 2: Latent factor analysis-based modeling • Assume there are some latent affecting factors • Latent factor analysis for syllable duration, pitch contour, energy and Inter-syllable coarticulation • Explore the relation between latent factors and syntactic information • How to integrate these two approaches and apply them to • Text-to-speech synthesis • Speech/tone/speaker recognition

Research Roadmap Current Achievements   Future Direction • Investigation in relation to prosody organization: F0 range and reset, naturalness and measurement, voice quality • Hierarchical modeling of fluent speech prosody • COSPRO corpus/Toolkits • Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation • RNN/VQ-based • prosodic modeling • Automatic prosodic labeling • Prosodic phrase analysis • High performance TTS • Mandarin, Min-south, Hakka • Model-based TTS • Corpus-based TTS • Tone modeling and recognition, MLP/RNN • HMM • Model-based tone recognizer • Eigen prosody analysis-based speaker recognition • Prosodic model-based • speaker recognition • Prosodic cues-dependent LM • Language model+pause, PM

Hierarchical Prosody Framework of Fluent Speech (1/4) • Hierarchical framework of fluent speech prosody for multi-phrase speech paragraphs • Hierarchical cross-phrase patterns and contributions are found in all 4 acoustic dimensions. • Acoustic templates are derived for each prosody level • F0 template • Syllable duration templates and temporal allocation patterns • Intensity distribution patterns • Boundary break patterns

Prosodic Group B5 Breath Group B4 B4 Initial PP Middle Prosodic Phrase Final PP B3 B3 PW PW .. .. .. .. .. .. .. .. .. .. .. .. .. PW B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Hierarchical Prosody Framework of Fluent Speech (2/4) • The Prosody Hierarchy with Prosodic Boundaries

F0 cadence of multi-phrase PG (Prosodic Phrase Group ) Tide over Wave and Ripple Syllable duration cadence of multi-phrase PG Hierarchical Prosody Framework of Fluent Speech (3/4) PG-initial PPh l the PW level PG-medial PPh l the PPh level PG-final PPh l

Duration Re-synthesis, F054C F0 Re-synthesis, F054C Original Original Original Original Modified Original Original Hierarchical Prosody Framework of Fluent Speech (4/4) • Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s

Latent Factor Analysis-based Prosody Modeling (1/3) • Syllable Duration Model • Multiplicative model • Additive model • Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models mean: 42.3 frames  43.9 frames variance: 180 frame2 2.52 frame2 RMSE: 1.93 frames (5ms/frame)

Latent Factor Analysis-based Prosody Modeling (2/3) • Syllable Pitch Contour Model • Mean model • Shape model • The patterns of x-3-3 • Reconstructed pitch mean

Latent Factor Analysis-based Prosody Modeling (3/3) • Inter-syllable coarticulation pitch contour model • The relationship of syllable pitch contours and affecting factors • Reconstructed pitch contour

Block diagram of TTS system TTS samples Mandarin/Taiwanese TTS

Tone Behavior Modeling and Recognition with Inter-Syllabic Features • Gabor-IFAS-based pitch detection • Four inter-syllabic features • Ratio of duration of adjacent syllables • Averaged pitch value over a syllable • Maximum pitch difference within a syllable • Averaged slope of the pitch contour over a syllable • Context-dependent tone behavior modeling

speakers sequences of prosody states prosodic features 1 1 prosody keywords Prosody keyword parsing ……. Prosody State Labeling Co-occurrence Matrix …......... A 1 …….. VQ-based Prosody modeling 2 dictionary Less breaks Fast speakers high dimensional prosody space eigen- prosody space Eigen-prosody analysis (SVD) VT S U A Slow speakers More breaks Eigen-Prosody Analysis-based Robust Speaker Recognition • Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data • Step 1: Automatic prosodic state labeling and speaker-keyword statistics • Step 2: Eigen-prosody space construction using Latent semantic analysis • Experimental results on HTIMIT corpus • Ten different handsets • 302 speakers • 7/3 utterances for training/test respectively

Research Infrastructure (1/2) • Sinica COSPRO and Toolkits: http://www.myet.com/COSPRO/ • 9 sets of Mandarin Chinese fluent speech corpora collected • Platform developed • Each corpus was designed to bring out different prosody features involved in fluent speech. • Annotation processes include labeling and tagging perceived units and boundaries in fluent speech, especially the ultimate unit the multiple phrase speech paragraph. • Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship characteristic to narrative or discourse organization.

Research Infrastructure (2/2) • Tree-Bank Speech Database • Uttered by a single female speaker • Short paragraphs, 110,000 syllables • Sentence-based syntactic tree annotated manually • Pitch contour and syllable segmentation corrected manually

Future Direction (1/5) • Automatic prosodic labeling of Mandarin speech corpus • Analysis of prosodic phrase structure • Model-based tone recognition • High performance TTS • Speech recognition/language modeling using prosodic cues • Prosodic modeling-based robust speaker recognition

Future Direction (2/5) • Automatic prosodic labeling of Mandarin Speech corpus • Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues: • Prosodic phrase boundary detection • Inter-syllable/inter-word coarticulation classification • Full/half/sandhi tone labeling for Tone 3 • Syllable pronunciation clustering • Homograph determination • The grouping of monosyllabic words with their neighboring words

Future Direction (3/5) • Analysis of prosodic phrase structure • 4-level prosody hierarchy: PW, PPh, BG, PG • Issues to be studied • Detection and classification of prosodic phrases • Relation between syntactic phrase structure and prosodic phrase structure • Other affecting factors: speaking rate, speaking style, emotion type, spontaneity of speech • Model-based tone recognition • Current approach • Acoustic feature normalization • Context-dependent tone modeling • Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour

Future Direction (4/5) • High performance TTS • Applying the sophisticated prosody models • Modular model of fluent speech prosody • Latent factor analysis-based modeling • Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient. • Consider both linguistic information and acoustic cues • Specially treat to monosyllabic words • Use the above prosody-syntax models to assist in the generation of prosodic information

Future Direction (5/5) • Speech recognition/language modeling using prosodic cues • Automatic prosodic states labeling • Prosodic state-dependent acoustic modeling • Prosodic state-dependent language modeling • Prosodic modeling-based robust speaker recognition • Automatic prosodic cues labeling • N-gram language model to learn the prosodic behavior of speakers • Applying principle component analysis (PCA) to N-gram to find a compact prosodic speaker space

Advanced Prosody Modeling for Text-To-Speech Synthesis

Advanced Prosody Modeling for Text-To-Speech Synthesis

Presentation Transcript

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis

TEXT TO SPEECH SYNTHESIS

A Text-to-Speech Synthesis System

Speech synthesis

Speech Processing Text to Speech Synthesis

6-Text To Speech (TTS) Speech Synthesis

FLST: Text-to-Speech Synthesis

Prosody Modeling (in Speech)

Stages in “text-to-speech” synthesis

5-Text To Speech (TTS) Speech Synthesis

Speech Synthesis

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

BIOVI Text To Speech (TTS) project

Introduction to text-to-speech synthesis

Fundamental Frequency Contour Synthesis for Turkish Text to Speech

Numerical Text-to-Speech Synthesis System

Text to speech

Text-to-speech Synthesis

Text-To-Speech Synthesis

transcription puppy - Text-To-Speech Synthesis Arrangement