Text analysis for speech synthesis Statistical Speech Synthesis Seminar, Cambridge, UK 13 April 2011 Sabine Buchholz

1. Text analysis for speech synthesisStatistical Speech Synthesis Seminar, Cambridge, UK13 April 2011Sabine BuchholzToshiba Research Europe Ltd., Cambridge Research Lab, Cambridge, UK

2. What is meant by text analysis? Various names for related concepts, various meanings for these names Text analysis/processing/decoding, TTS front-end, linguistic processing Here: From text to context-dependent label sequence See 2 of Heiga�s slides Plus set of potential questions Ideal: Should provide all information relevant to synthesis of this text Phone sequence, including pauses Boundaries of higher level units: syllables, words, utterances etc. Additional information about units: Vowel? Stressed? Emphasized? Content word? Subject? Wh-question? Direct speech? Angry? etc. 1-best solution? Phone lattice? Probability e.g. for emphasis?

4. Composition of sentence HMM for given text

5. Main focus of this presentation, overview Text analysis at synthesis time (not training) But will discuss training/synthesis mismatch Text analysis for European languages Others might be mentioned in passing Overview of all aspects of text analysis rather than in-depth study of certain sub-problems Views Na�ve view Linguistic view Code component view Machine learning view

6. Na�ve view What is text? What information is available? What decision do humans make? What information do humans use?

7. Example: Text of an audiobook (HTML) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> � <body> <h2> A TRAMP ABROAD BY MARK TWAIN, Part 1 </h2> <a href="p2.htm">Next Part</a> <a href="119-h.htm">Main Index</a> <h1> A TRAMP ABROAD, Part 1</h1> <h2> By Mark Twain </h2> <a name="ch1"></a><center><h2> CHAPTER I </h2> <h3> [The Knighted Knave of Bergen] </h3></center> <p> One day it occurred to me that it had been many years since the world � Text might contain mark-up (sometimes looks like text, e.g. �>� in email) some potentially relevant e.g. <h1>, <p>, <em> also: SSML, SAPI some should be translated to text e.g. <a href=�> some should be ignored e.g. <center> Need mark-up specific preprocessing

8. Example: Text of an audiobook (plain text) The Project Gutenberg EBook of A Tramp Abroad, by Mark Twain This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: A Tramp Abroad Complete Author: Mark Twain (Samuel Clemens) Release Date: June 3, 2009 [EBook #119] Language: English Character set encoding: ASCII *** START OF THIS PROJECT GUTENBERG EBOOK A TRAMP ABROAD ***

9. Example: Text of an audiobook Produced by Anonymous Volunteers, John Greenman and David Widger A TRAMP ABROAD, Part 1 By Mark Twain (Samuel L. Clemens) First published in 1880 Illustrations taken from an 1880 First Edition * * * * * * ILLUSTRATIONS: 1. PORTRAIT OF THE AUTHOR 2. TITIAN'S MOSES �

10. Example: Text of an audiobook CONTENTS CHAPTER I A Tramp over Europe--On the Holsatia--Hamburg--Frankfort-on-the- Main--How it Won its Name--A Lesson in Political Economy--Neatness in Dress--Rhine Legends--"The Knave of Bergen" The Famous Ball--The Strange Knight--Dancing with the Queen--Removal of the Masks--The Disclosure--Wrath of the Emperor--The Ending � CHAPTER I [The Knighted Knave of Bergen]

11. Example: Text of an audiobook One day it occurred to me that it had been many years since the world had been afforded the spectacle of a man adventurous enough to undertake a journey through Europe on foot. After much thought, I decided that I was a person fitted to furnish to mankind this spectacle. So I determined to do it. This was in March, 1878. I looked about me for the right sort of person to accompany me in the capacity of agent, and finally hired a Mr. Harris for this service. It was also my purpose to study art while in Europe. Mr. Harris was in sympathy with me in this. He was as much of an enthusiast in art as I was, and not less anxious to learn to paint. I desired to learn the German language; so did Harris. Toward the middle of April we sailed in the HOLSATIA, Captain Brandt, and had a very pleasant trip, indeed.

12. How do humans make these decisions? End of utterance? � and finally hired a Mr. Harris for this service. � he had not thrown himself away--he had gained a woman of 10,000 l. or thereabouts; � On Sun. John W. Doersch in Town Centerville, received the news that his son Otto Doersch had died ... Known for her salty language and accordian playing, Yahoo! CEO Carol Bartz isn't your typical Internet executive. To Terence's demand, "She seems to be better?" he replied, looking at him in an odd way, "She has a chance of life.� Roman numerals CHAPTER I Symbols #119

13. How do humans make these decisions? Numbers �June 3, 2009� �in 1880� �an 1880 First Edition� Lists �Police received 137 999 calls� Capitalization Nice is the fifth most populous city in France, ... IT DEPARTMENT HEAD/MANAGER lets go to it department!!! Charlemagne, while chasing the Saxons (as HE said), or being chased by them (as THEY said), �

14. How do humans make these decisions? Words One day it occurred to me Not as easy in some other languages, e.g. Japanese, Chinese Similar problem in URLs, e.g. www.abeautifulmind.com/ www.iwantcandy.com/ Sometimes more needed to identify word polishin� 2morrow, tmrrw Latter similar to Hebrew/Arabic writing caf�/cafe Much bigger problem for e.g. French: pass�/passe have different pronunciation German: ��, ��, ��, �ߔ can be replaced by �ae�, �oe�, �ue�, �ss� Frankfort-on-the- Main

15. How do humans make these decisions? Pronunciations Most words He lives/their lives; bass guitar/bass fishing �neatness� �leafbuds� Very common in some languages, e.g. German compounding Known allowed patterns e.g. {name|noun|adj|verb}-noun but not {�}-name �HOLSATIA� �Brandt� Acronyms: ASCII , JUD Chunks/pauses One day it occurred to me that it had been many years since the world had been afforded the spectacle of a man adventurous enough to undertake a journey through Europe on foot.

16. How do humans make these decisions? Emotions �� he shouted angrily. I wish I had something for Joe. "I have told you, have I not, that you are too late? There is another, and if I have not promised to marry him at once, at least I can promise no one else.� Emphasis/prominence After much thought, I decided that I was a person fitted to furnish to mankind this spectacle. So I determined to do it. Style/expression One day it occurred to me that it had been many years since the world had been afforded the spectacle of a man adventurous enough to undertake a journey through Europe on foot. After much thought, I decided that I was a person fitted to furnish to mankind this spectacle.

17. How do humans make these decisions? Summary Many different sources of information Prior knowledge: known names, abbreviations, expressions, etc. Form of token itself, form of immediate context Words in immediate/sentence/next lines/whole text context Also classes of words (e.g. month name) Expressions in context Position in text Which interpretation results in more likely sentence Text type, domain Experience/understanding

18. Linguistic view: Text normalization Replace �non-standard words� (NSW) by standard ones Abbreviations, numbers, symbols, sometimes acronyms Several parts Identification of potential candidates, ambiguity class Not always trivial: e.g. not all abbreviations have periods Disambiguation (if necessary) Expansion Can be very simple: Mr. ? Mister Or more complex: �$40.4m� ? �forty point four million dollars� Might need more information to get case/gender/number/etc. right E.g. German Der 1. Versuch (the first trial) ? erste Ein 1. Versuch (a first trial) ? erster Ein 1. Experiment (a first experiment) ? erstes

19. Linguistic view Homographs (heteronyms) same spelling but different meanings (with different pronunciations) Can have different part-of-speech (POS), e.g. �live� (verb/adj), �lives� (verb/noun) Normally ambiguity between: noun, verb, adjective Exceptions: I, A, read, � Resolved by POS tagging Or not: row, bow, bass, axes, � Resolved by Word Sense Disambiguation

20. Linguistic view: Lexical information Information about words Orthography Including capitalization: it, IT Can include non-letters: Yahoo!, 2morrow Pronunciation(s) Phone sequence (canonical pronunciation), syllabification If appropriate for language Lexical stress (primary/secondary): ,funda�mental Lexical tone (e.g. in Chinese) Part(s)-of-speech; if necessary linked to different pronunciations Lexicon defines ambiguity classes �Nice� (/ni:s/, name) and �nice� (/na?s/, adjective) Ideally with frequency information (�nice� generally more frequent) Ideally domain-specific, if TTS domain known Other: is-abbreviation, gender, frequency, senses, association with emotions, language of origin, semantic class, �

21. Linguistic view: Pronunciation prediction Letter-to-sound (LTS)/Grapheme-to-phoneme (G2P) Map sequence of letters into sequence of phones Phones: elements of phone set Much agreement, minor variations details not only guided by phonetics/phonology but also practical considerations Can humans/systems reliably find boundary/make distinction? Size of phone set Examples of differences: diphthongs (a?, ??) as one or two, rhotic (r-colored) vowels as one or two, xenophones (phones only used in foreign words) Use info about morphology, stress, syllabification, (foreign) language Some languages easier (more phonemic) than others English not very phonemic�

22. Linguistic view: Post-lexical effects Change canonical pronunciation into natural one Depends on setting/context, i.e. not lexical Speaker, e.g. dialect influence Style, e.g. formal/colloquial Sentence context (phrasing/pausing) Adjacent phone of previous/next word; blocked by pause Can be deletion, substitution or insertion E.g. schwa deletion, vowel reduction, glottal stop insertion Sometimes use/introduce additional elements in phone set Silent phones e.g. �h� in French: special symbol to prevent liaison Les hommes /lez ?m/ vs. les h�ros /le (h)ero/ Glottal stop (but is regular phone in some languages, e.g. German) Syllabic consonants, e.g. even�[i?vn?] Unreleased plosives, e.g. stopped [st?p?d] Liaison also depends on POS (and maybe syntax)

23. Linguistic view: parts-of-speech (POS) POS tag describes morphosyntactic properties of word I see the tree You spot a dog *He *sleep *sees Including more/fewer properties increases/decreases tag set size WSJ: 36+12 punctuation, BNC: 58 or 138 POS tagging: Assign each word a unique tag from tag set Two stages Determine possible POS (list, or frequency distribution) for each word Lexicon Prediction using Capitalization Full morphology or simple string prefixes/suffixes Disambiguate whole sequence (i.e. through context)

24. Linguistic view: shallow parsing/chunking Easier/more efficient than full parsing Can be used as first step Chunking: divide utterance into non-overlapping continuous word sequences Precise definitions of desired sequences differ But related to content/function word distinction Function/closed class words: prepositions, conjunctions, articles, auxiliary verbs, � Content/open class words: nouns, main verbs, adjectives, � Simple: �Chinks �n chunks algorithm� (Liberman&Church,1992) Chinks: function words (+ tensed verb forms) Chunks: content words (+ object pronouns) Divide greedily into sequences consisting of {chink* chunk*} Relies on POS tagger (or lists of closed class words)

25. Linguistic view: syntactic structure Parsing: Assign syntactic structure to whole utterance Phrase structure I slept John will sleep The woman saw the tree A very big dog with shaggy fur gave him a book waited for Mary quickly left thought that this was not a good idea noun phrase (NP) verb phrase (VP) Head: determines syntactic properties of phrase Syntactic/grammatical function/role E.g. NP above is subject of sentence, VP is predicate

26. Linguistic view: syntactic structure Phrase structure S VP NP NP NP The woman gave him a book Dependency structure The woman gave him a book

27. Linguistic view: syntax and TN, pronunciation TN disambiguation �the 1880 edition�: number cannot be quantifier if noun is singular TN expansion $5 ? five dollars, $5 bill ? five dollar bill: latter is modifier POS-based homograph disambiguation PLE Reducing function word �that�, but only if it is conjunction, not demonstrative E.g. �I thought that that is the case� u-u r-r u-r r-u French liaison

28. Linguistic view: syntax and prosody Prominence/pitch accent Function words typically deaccented, content words often accented In sequences of content words, more fine-grained distinctions E.g. noun-noun: �music lessons�, name-name: �John Doe�, adjective-noun: �weekly lessons�; but exceptions� Accented syllables louder, longer, more F0 movement Different types of pitch accents sometimes distinguished Prosodic phrasing/chunking, boundary tones Phrase boundary (type) has effect on duration (pre-boundary lengthening) and F0 (boundary tone, e.g. rising/falling), might evoke pause Pausing, e.g. common before conjunctions/sub-clauses Also pausing for emphasis/effect� Example One day | it occurred to me |� that it had been many years |� since the world | had been afforded | the spectacle | of a man adventurous enough |� to undertake | a journey | through Europe | on foot.

29. Linguistic view: Semantics, pragmatics Disambiguation of same-POS homographs Prediction of some types of prominence Emphasis, contrast, given/new; French teachers vs. French teachers Prediction of some types of pauses Prediction of emotion/style from text

30. System architecture view: components Paragraph/utterance/token splitting Text normalization (numbers, symbols, abbreviations) Lexicon lookup (and if necessary disambiguation) or LTS, syllabification, stress prediction, morphological analysis; acronym code Note: lexicon is component in its own right: Good lexicon coverage prevents prediction errors Syntactic analysis POS tagging Parsing Sentence type (e.g. yes/no question vs. wh-question) Prosodic analysis Prosodic phrasing Pause assignment Pitch accent assignment Post-lexical effects Special, e.g. dialogue act, identify foreign language inclusions, style/emotion, direct speech, character

31. System architecture view: components, order Not all components in all systems Very phonemic languages might not need phone level/LTS Some languages don�t have lexical stress (French), some mark it in orthography (Spanish) Some languages might not need PLE HMM itself could model some PLE (depends on set-up) Can do approximations of morphology, POS tagging, phrasing, prominence Parser rare Essential(?): unit splitting, TN, pause, some sort of lexicon No general semantics Typical architecture: long pipeline of components But interactions to consider: End of utterance detection and abbreviation expansion POS tagging and pronunciation lookup (homographs) Syntax and TN

32. Machine learning view Machine learning can be used for all components But manual rules often still used for TN, pausing, PLE� Usual choice of supervised, semi-supervised, unsupervised machine learning approaches Supervised if annotated data available Very few annotated corpora for TN training No annotated data for some languages Lots of manual feature engineering To accommodate various information sources

33. Machine learning view Supervised machine learning Classification (Decision trees, SVMs etc.) Prediction POS ambiguity class, SMS speak, emotion Disambiguation End of utterance, number disambiguation, (semantic) homographs Sequence prediction/disambiguation (Classifiers using left/right contexts+previous decisions, HMMs, CRFs etc.) LTS need to align data first: one letter to 0, 1 or more phones (EM) Features: letter context, POS, stress, syllabification, language POS disambiguation

34. Machine learning view Supervised machine learning (continued) Classification Sequence prediction/disambiguation � prosodic boundaries, pauses Most are �no boundary/pause� ? uninformative context ? use feature: distance from last boundary/pause (Schmid&Atterer 2004) Issues: Optional pauses, wrong pause worse than missed Transformation (e.g. Transformation-based learning: TBL) TN expansion, e.g. �$40.4m� ? �forty point four millions dollars� PLE using TBL (Webster et al. 2005) Trees: Parsing Reduce to classification problem, e.g. shift-reduce parsing Or use dedicated algorithm, e.g. find maximum spanning tree in directed graph (McDonald et al. 2005)

35. Machine learning view Semi-supervised Because many components domain-dependent TN disambiguation, POS tagging, parsing E.g. adjective �live� much more likely in TV/music domain But annotated data only available for selected domains Tagging/parsing general approach Get unannotated data from new domain Process with tagger/parser trained on old domain Add resulting data to training data Or for HMM tagging if only POS ambiguity classes (i.e. lexicon) given Use maximum likelihood estimation, or fully Bayesian approach, to get probabilities (e.g. Dufour&Favre 2010) Speaker-specific training/adaptation of e.g. prominence, pause, phrase prediction???

36. Machine learning view Unsupervised Examples For end-of-utterance detection: Build list of possible abbreviations from large corpus by recording which letter strings occur only/not only followed by period For abbreviation expansion: Find unabbreviated form that starts with abbreviation string and occurs in similar contexts Syllabification: Learn probabilities of syllable beginnings/endings from beginnings/endings of whole words, then find most likely division into syllables

37. Machine learning view: train/test material Text analysis � based only on text at synthesis time but for annotation of training data 2-3 possibilities From text only, automatically From text and audio, automatically For phone, pause, chunk, prominence From text (and audio if applicable), by hand Trade-off What is best (given circumstances)? 2. and 3. allow training of speaker-specific prediction

38. Machine learning view: Evidence Webster et al., Interspeech 2010 Comparing 1. (automatic) with 3. (manual) for annotation of phone sequence in training data Phone boundaries in both cases through forced alignment (i.e. 2.) As pilot study found no difference to use of manual boundaries Phone sequence at synthesis time predicted automatically (i.e. 1.) Automatic prediction through application of (manually written) PLE rules to canonical pronunciations No significant difference between preference for automatic vs. manual phone sequence system (41% vs. 43%, 15 listeners, p=0.42) Only one, quite large (>2000 sentences) corpus used

39. Machine learning view: �higher� information Some linguistic analysis needed by other text analysis components, e.g. POS for homograph disambiguation But pass output to back-end (duration, F0, spectrum) as well Useful? Confusing? Overfitting? Depends on mechanism to prevent overfitting? Others only for use by back-end (phrasing, prominence) Typically, the �higher� the linguistic processing, the harder ? lower accuracy/agreement (for humans as well as machines!) Is it worth doing? Depends on annotation approach for training data Depends on mechanism to deal with rich, weak contexts?

40. Machine learning view: Evidence Watts et al., Interspeech 2010 Trained 12 HMM-TTS systems, combinations of Annotation of training/test data �gold� (3.) or �auto� (1.): g-g, g-a, a-a Context features (questions) used: lexical, +POS, +phrase, +ToBI Showed how removing �higher level� features results in more �surrogate� questions about �lower level� features used in trees Evaluated differences in synthesis quality by preference tests Levels only compared for gold-gold system (unrealistic) Only 8 listeners Only one preference significant: lexical-only vs. all features But trend: fewer higher level features = lower preference But hypothesize that effect for �auto� even smaller Trend: �gold� better than �auto�

41. Machine learning view From output of text analysis to context features Still some freedom E.g. �use POS�: of current word only, or also preceding/following? Also: influence of decision tree questions E.g. �POS feature�: identity question only, or clustering?

42. Conclusion Text analysis for speech synthesis Complex decisions using variety of information Rich field for application of machine learning Importance of lexical coverage and/or domain adaptation Future Much information used by humans still not available Diversity of components and methods: streamline? Further research needed into what works best in which exact circumstances Corpus size, language, domain, style, sentence length/complexity, phone/tag set, manual or automatic annotation, its accuracy, other features used, �

43. Linguistic view: punctuation, syntax and prosody

Text analysis for speech synthesis Statistical Speech Synthesis Seminar, Cambridge, UK 13 April 2011 Sabine Buchholz

Text analysis for speech synthesis Statistical Speech Synthesis Seminar, Cambridge, UK 13 April 2011 Sabine Buchholz

Presentation Transcript

TEXT TO SPEECH SYNTHESIS

A Text-to-Speech Synthesis System

Speech synthesis

Speech Processing Text to Speech Synthesis

6-Text To Speech (TTS) Speech Synthesis

SPEECH PRODUCTION,RECOGNITION, ANALYSIS, AND SYNTHESIS

FLST: Text-to-Speech Synthesis

Stages in “text-to-speech” synthesis

Speech Synthesis

5-Text To Speech (TTS) Speech Synthesis

Speech Synthesis

Perspectives for Articulatory Speech Synthesis

Speech Synthesis Technology

Speech Synthesis

Introduction to text-to-speech synthesis

Analysis and Synthesis of Shouted Speech

Visible Speech Synthesis

Numerical Text-to-Speech Synthesis System

4. Speech Synthesis

Text-to-speech Synthesis

Text-To-Speech Synthesis

5- Speech Synthesis