Stages in “text-to-speech” synthesis

EE2F1Multimedia (1): Speech & Audio TechnologyLecture 7: Speech Synthesis (1)Martin RussellElectronic, Electrical & Computer EngineeringSchool of EngineeringThe University of Birmingham

Stages in “text-to-speech” synthesis • Text normalisation • Text-to-phone conversion • Linguistic analysis • Semantic analysis • Conversion of phone-sequence to sequence of synthesiser control parameters • Synthesis of acoustic speech signal

Approaches to synthesis • Final stage is to convert ‘phone’ or word sequence into a sequence of synthesiser control parameters • Two main approaches: • Waveform concatenation • Model-based speech synthesis (inludes articulatory synthesis)

Waveform Concatenation • Join together, or concatenate, stored sections of real speech • Sections may correspond to whole word, or sub-word units • Early systems based on wholewords • E.G: Speaking clock - UK telephone system, 1936 • Storage and access major issues • Speech quality requires data-rates of 16,000 to 32,000 bits per second (bps)

1936 “Speaking Clock” From John Holmes, “Speech synthesis and recognition”, courtesy of British Telecommunications plc

Whole word concatenation (1) • Whole word concatenation can give good quality speech (as in speaking clock), but has many disadvantages: • pronunciation of a word influenced by neighbouring words (co-articulation) • prosodic effects like intonation, rate-of-speaking and amplitude also influenced by context. • interpretation of a sentence will be strongly influenced by details of individual words used (“Mary didn’t buy Sam a pizza”)

Whole word concatenation (2) • Disadvantages (continued): • words must be extracted from the right sort of sentence • most suitable for applications where structure of the sentence is constrained, e.g., announcements, lists… • may need to record more than one example of each word, e.g., raised pitch at end of a list, pre-pause lengthening…

Example – original recording The next train to arrive at platform 2 will call at Bromsgrove, Droitwich Spa, Worcester Foregate Street and Malvern Link

Example – trivial concatenative synthesis The next train to arrive at platform 2 will call at Malvern Link, Worcester Foregate Street, Droitwich Spa and Bromsgrove

Example repeated • Original recording • ‘Concatenative synthesis’

Whole word concatenation (3) • Disadvantages (continued): • to add new words the original speaker must be found, or all words must be re-recorded • even with specialist facilities, selection and extraction of suitable words is labour intensive and time consuming

Sub-word concatenation (1) • Limitations of word-based methods suggest concatenative speech synthesis based on sub-word units • Need well-annotated, phonetically-balanced corpus of speech recordings • Extract fragments from waveforms in the corpus which represent ‘basic units’ of speech, and can be concatenated and used for speech synthesis

Sub-word concatenation (2) • Difficulties include: • identification of a set of suitable units • careful annotation of large amounts of data • derivation of a good method for concatenation

Sub-word concatenation (3) • Sub-word concatenation overcomes difficulties with adding new words to the application vocabulary, • But, other problems exacerbated. • In particular, coarticulation and pitch continuity problems occur within, as well as between, words. • Necessary to use several examples of each phone (corresponding roughly to different allophones).

Sub-word concatenation (4) • Natural to select fragments that characterise the phone target values, but modelling transitions between these targets is a significant problem

Example: sub-word concatenation “stack” (original) “task” sub-word concatenative synthesis

Transitional units (1) • Central regions of many speech sounds are approximately stationary and less susceptible to coarticulation effects. • Hence select fragments which characterise transitions between phones, rather than phone targets. • e.g., diphone - transition between two phones.

Transitional units (2) • There are contextually-induced differences between instantiations of the central region of phone, which cause discontinuities if they are not attended to. • Possible solutions are: • use several different examples of each diphone • store short transition regions, and • interpolate between end values

Transitional units (3) • Coping with coarticulation effects by modelling transitions and • (a) using multiple examples to cope with variation in the instantiation of the phone centres, and • (b) by interpolation between short transition regions

More on prosody • Discontinuity in the fundamental frequency exacerbated for sub-word methods. • Can use source-filter model to separate-excitation signal from vocal-tract shape. • Vocal-tract shape descriptions can then be concatenated and an appropriately smooth fundamental frequency pattern can be added separately.

PSOLA: Pitch Synchronous Overlap and Add • PSOLA (Charpentier, 1986) • Most successful current approach to concatenative synthesis • In PSOLA, the end regions of windowed waveform samples are overlapped pitch-synchronously and added • BT’s Laureate is an example

PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

Speech modification using PSOLA • In addition to speech synthesis from segments, there are two other common applications of PSOLA: • Pitch modification • Duration modification

Increasing pitch using PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

Decreasing pitch using PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

The ‘Laureate’ System • The BT “Laureate” system is a modern, PSOLA-based synthesiser • See Edington et al. (1996a), also look at the web site • Demonstration

PSOLA strengths and weaknesses • Strengths • Produces good quality speech • Weaknesses • Large, annotated corpus needed for each ‘voice’ • Requires accurate pitch peak detection • Inflexible – new voices can only be produced by recording and labelling significant speech corpora from new speakers • Automatic annotation of corpora using techniques from speech recognition

Summary • Concatenative speech synthesis • Whole word concatenation • Importance of prosody • Sub-word concatenation • Choice of sub-word units • PSOLA

Stages in “text-to-speech” synthesis