430 likes | 731 Views
What is meant by text analysis?. Various names for related concepts, various meanings for these namesText analysis/processing/decoding, TTS front-end, linguistic processingHere: From text to context-dependent label sequenceSee 2 of Heiga's slidesPlus set of potential questionsIdeal: Should prov
E N D
1. Text analysis for speech synthesisStatistical Speech Synthesis Seminar, Cambridge, UK13 April 2011Sabine BuchholzToshiba Research Europe Ltd., Cambridge Research Lab, Cambridge, UK
2. What is meant by text analysis? Various names for related concepts, various meanings for these names
Text analysis/processing/decoding, TTS front-end, linguistic processing
Here: From text to context-dependent label sequence
See 2 of Heiga’s slides
Plus set of potential questions
Ideal: Should provide all information relevant to synthesis of this text
Phone sequence, including pauses
Boundaries of higher level units: syllables, words, utterances etc.
Additional information about units: Vowel? Stressed? Emphasized? Content word? Subject? Wh-question? Direct speech? Angry? etc.
1-best solution? Phone lattice? Probability e.g. for emphasis?
4. Composition of sentence HMM for given text
5. Main focus of this presentation, overview Text analysis at synthesis time (not training)
But will discuss training/synthesis mismatch
Text analysis for European languages
Others might be mentioned in passing
Overview of all aspects of text analysis rather than in-depth study of certain sub-problems
Views
Naďve view
Linguistic view
Code component view
Machine learning view
6. Naďve view What is text?
What information is available?
What decision do humans make?
What information do humans use?
7. Example: Text of an audiobook (HTML) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
…
<body>
<h2> A TRAMP ABROAD BY MARK TWAIN, Part 1 </h2>
<a href="p2.htm">Next Part</a>
<a href="119-h.htm">Main Index</a>
<h1> A TRAMP ABROAD, Part 1</h1>
<h2> By Mark Twain </h2>
<a name="ch1"></a><center><h2> CHAPTER I </h2>
<h3> [The Knighted Knave of Bergen] </h3></center>
<p> One day it occurred to me that it had been many years since the world …
Text might contain mark-up (sometimes looks like text, e.g. ‘>’ in email)
some potentially relevant
e.g. <h1>, <p>, <em> also: SSML, SAPI
some should be translated to text
e.g. <a href=…>
some should be ignored
e.g. <center>
Need mark-up specific preprocessing
8. Example: Text of an audiobook (plain text) The Project Gutenberg EBook of A Tramp Abroad, by Mark Twain
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: A Tramp Abroad
Complete
Author: Mark Twain (Samuel Clemens)
Release Date: June 3, 2009 [EBook #119]
Language: English
Character set encoding: ASCII
*** START OF THIS PROJECT GUTENBERG EBOOK A TRAMP ABROAD ***
9. Example: Text of an audiobook Produced by Anonymous Volunteers, John Greenman and David Widger
A TRAMP ABROAD, Part 1
By Mark Twain
(Samuel L. Clemens)
First published in 1880
Illustrations taken from an 1880 First Edition
* * * * * *
ILLUSTRATIONS:
1. PORTRAIT OF THE AUTHOR
2. TITIAN'S MOSES
…
10. Example: Text of an audiobook CONTENTS
CHAPTER I A Tramp over Europe--On the
Holsatia--Hamburg--Frankfort-on-the- Main--How it Won its Name--A Lesson
in Political Economy--Neatness in Dress--Rhine Legends--"The Knave
of Bergen" The Famous Ball--The Strange Knight--Dancing with the
Queen--Removal of the Masks--The Disclosure--Wrath of the Emperor--The
Ending
…
CHAPTER I
[The Knighted Knave of Bergen]
11. Example: Text of an audiobook One day it occurred to me that it had been many years since the world
had been afforded the spectacle of a man adventurous enough to undertake
a journey through Europe on foot. After much thought, I decided that
I was a person fitted to furnish to mankind this spectacle. So I
determined to do it. This was in March, 1878.
I looked about me for the right sort of person to accompany me in the
capacity of agent, and finally hired a Mr. Harris for this service.
It was also my purpose to study art while in Europe. Mr. Harris was in
sympathy with me in this. He was as much of an enthusiast in art as
I was, and not less anxious to learn to paint. I desired to learn the
German language; so did Harris.
Toward the middle of April we sailed in the HOLSATIA, Captain Brandt,
and had a very pleasant trip, indeed.
12. How do humans make these decisions? End of utterance?
… and finally hired a Mr. Harris for this service.
… he had not thrown himself away--he had gained a woman of 10,000 l. or thereabouts; …
On Sun. John W. Doersch in Town Centerville, received the news that his son Otto Doersch had died ...
Known for her salty language and accordian playing, Yahoo! CEO Carol Bartz isn't your typical Internet executive.
To Terence's demand, "She seems to be better?" he replied, looking at him in an odd way, "She has a chance of life.“
Roman numerals
CHAPTER I
Symbols
#119
13. How do humans make these decisions? Numbers
“June 3, 2009”
“in 1880”
“an 1880 First Edition”
Lists
“Police received 137 999 calls”
Capitalization
Nice is the fifth most populous city in France, ...
IT DEPARTMENT HEAD/MANAGER
lets go to it department!!!
Charlemagne, while chasing the Saxons (as HE said), or being chased by them (as THEY said), …
14. How do humans make these decisions? Words
One day it occurred to me
Not as easy in some other languages, e.g. Japanese, Chinese
Similar problem in URLs, e.g.
www.abeautifulmind.com/ www.iwantcandy.com/
Sometimes more needed to identify word
polishin’
2morrow, tmrrw
Latter similar to Hebrew/Arabic writing
café/cafe
Much bigger problem for e.g. French: passé/passe have different pronunciation
German: “ä”, “ö”, “ü”, “ß” can be replaced by “ae”, “oe”, “ue”, “ss”
Frankfort-on-the- Main
15. How do humans make these decisions? Pronunciations
Most words
He lives/their lives; bass guitar/bass fishing
“neatness” “leafbuds”
Very common in some languages, e.g. German compounding
Known allowed patterns
e.g. {name|noun|adj|verb}-noun but not {…}-name
“HOLSATIA”
“Brandt”
Acronyms: ASCII , JUD
Chunks/pauses
One day it occurred to me that it had been many years since the world had been afforded the spectacle of a man adventurous enough to undertake a journey through Europe on foot.
16. How do humans make these decisions? Emotions
“…” he shouted angrily.
I wish I had something for Joe.
"I have told you, have I not, that you are too late? There is another, and if I have not promised to marry him at once, at least I can promise no one else.“
Emphasis/prominence
After much thought, I decided that I was a person fitted to furnish to mankind this spectacle. So I determined to do it.
Style/expression
One day it occurred to me that it had been many years since the world had been afforded the spectacle of a man adventurous enough to undertake a journey through Europe on foot. After much thought, I decided that I was a person fitted to furnish to mankind this spectacle.
17. How do humans make these decisions? Summary
Many different sources of information
Prior knowledge: known names, abbreviations, expressions, etc.
Form of token itself, form of immediate context
Words in immediate/sentence/next lines/whole text context
Also classes of words (e.g. month name)
Expressions in context
Position in text
Which interpretation results in more likely sentence
Text type, domain
Experience/understanding
18. Linguistic view: Text normalization Replace “non-standard words” (NSW) by standard ones
Abbreviations, numbers, symbols, sometimes acronyms
Several parts
Identification of potential candidates, ambiguity class
Not always trivial: e.g. not all abbreviations have periods
Disambiguation (if necessary)
Expansion
Can be very simple: Mr. ? Mister
Or more complex: “$40.4m” ? “forty point four million dollars”
Might need more information to get case/gender/number/etc. right
E.g. German
Der 1. Versuch (the first trial) ? erste
Ein 1. Versuch (a first trial) ? erster
Ein 1. Experiment (a first experiment) ? erstes
19. Linguistic view Homographs (heteronyms)
same spelling but different meanings (with different pronunciations)
Can have different part-of-speech (POS), e.g. “live” (verb/adj), “lives” (verb/noun)
Normally ambiguity between: noun, verb, adjective
Exceptions: I, A, read, …
Resolved by POS tagging
Or not: row, bow, bass, axes, …
Resolved by Word Sense Disambiguation
20. Linguistic view: Lexical information Information about words
Orthography
Including capitalization: it, IT
Can include non-letters: Yahoo!, 2morrow
Pronunciation(s)
Phone sequence (canonical pronunciation), syllabification
If appropriate for language
Lexical stress (primary/secondary): ,funda’mental
Lexical tone (e.g. in Chinese)
Part(s)-of-speech; if necessary linked to different pronunciations
Lexicon defines ambiguity classes
“Nice” (/ni:s/, name) and “nice” (/na?s/, adjective)
Ideally with frequency information (“nice” generally more frequent)
Ideally domain-specific, if TTS domain known
Other: is-abbreviation, gender, frequency, senses, association with emotions, language of origin, semantic class, …
21. Linguistic view: Pronunciation prediction Letter-to-sound (LTS)/Grapheme-to-phoneme (G2P)
Map sequence of letters into sequence of phones
Phones: elements of phone set
Much agreement, minor variations
details not only guided by phonetics/phonology but also practical considerations
Can humans/systems reliably find boundary/make distinction?
Size of phone set
Examples of differences: diphthongs (a?, ??) as one or two, rhotic (r-colored) vowels as one or two, xenophones (phones only used in foreign words)
Use info about morphology, stress, syllabification, (foreign) language
Some languages easier (more phonemic) than others
English not very phonemic…
22. Linguistic view: Post-lexical effects Change canonical pronunciation into natural one
Depends on setting/context, i.e. not lexical
Speaker, e.g. dialect influence
Style, e.g. formal/colloquial
Sentence context (phrasing/pausing)
Adjacent phone of previous/next word; blocked by pause
Can be deletion, substitution or insertion
E.g. schwa deletion, vowel reduction, glottal stop insertion
Sometimes use/introduce additional elements in phone set
Silent phones e.g. ‘h’ in French: special symbol to prevent liaison
Les hommes /lez ?m/ vs. les héros /le (h)ero/
Glottal stop (but is regular phone in some languages, e.g. German)
Syllabic consonants, e.g. even [i?vn?]
Unreleased plosives, e.g. stopped [st?p?d]
Liaison also depends on POS (and maybe syntax)
23. Linguistic view: parts-of-speech (POS) POS tag describes morphosyntactic properties of word
I see the tree
You spot a dog
*He *sleep
*sees
Including more/fewer properties increases/decreases tag set size
WSJ: 36+12 punctuation, BNC: 58 or 138
POS tagging: Assign each word a unique tag from tag set
Two stages
Determine possible POS (list, or frequency distribution) for each word
Lexicon
Prediction using
Capitalization
Full morphology or simple string prefixes/suffixes
Disambiguate whole sequence (i.e. through context)
24. Linguistic view: shallow parsing/chunking Easier/more efficient than full parsing
Can be used as first step
Chunking: divide utterance into non-overlapping continuous word sequences
Precise definitions of desired sequences differ
But related to content/function word distinction
Function/closed class words: prepositions, conjunctions, articles, auxiliary verbs, …
Content/open class words: nouns, main verbs, adjectives, …
Simple: “Chinks ’n chunks algorithm” (Liberman&Church,1992)
Chinks: function words (+ tensed verb forms)
Chunks: content words (+ object pronouns)
Divide greedily into sequences consisting of {chink* chunk*}
Relies on POS tagger (or lists of closed class words)
25. Linguistic view: syntactic structure Parsing: Assign syntactic structure to whole utterance
Phrase structure
I slept
John will sleep
The woman saw the tree
A very big dog with shaggy fur gave him a book
waited for Mary
quickly left
thought that this was not a good idea
noun phrase (NP) verb phrase (VP)
Head: determines syntactic properties of phrase
Syntactic/grammatical function/role
E.g. NP above is subject of sentence, VP is predicate
26. Linguistic view: syntactic structure Phrase structure
S
VP
NP NP NP
The woman gave him a book
Dependency structure
The woman gave him a book
27. Linguistic view: syntax and TN, pronunciation TN disambiguation
“the 1880 edition”: number cannot be quantifier if noun is singular
TN expansion
$5 ? five dollars, $5 bill ? five dollar bill: latter is modifier
POS-based homograph disambiguation
PLE
Reducing function word “that”, but only if it is conjunction, not demonstrative
E.g. “I thought that that is the case”
u-u r-r u-r r-u
French liaison
28. Linguistic view: syntax and prosody Prominence/pitch accent
Function words typically deaccented, content words often accented
In sequences of content words, more fine-grained distinctions
E.g. noun-noun: “music lessons”, name-name: “John Doe”, adjective-noun: “weekly lessons”; but exceptions…
Accented syllables louder, longer, more F0 movement
Different types of pitch accents sometimes distinguished
Prosodic phrasing/chunking, boundary tones
Phrase boundary (type) has effect on duration (pre-boundary lengthening) and F0 (boundary tone, e.g. rising/falling), might evoke pause
Pausing, e.g. common before conjunctions/sub-clauses
Also pausing for emphasis/effect…
Example
One day | it occurred to me |¦ that it had been many years |¦ since the world | had been afforded | the spectacle | of a man adventurous enough |¦ to undertake | a journey | through Europe | on foot.
29. Linguistic view: Semantics, pragmatics Disambiguation of same-POS homographs
Prediction of some types of prominence
Emphasis, contrast, given/new; French teachers vs. French teachers
Prediction of some types of pauses
Prediction of emotion/style from text
30. System architecture view: components Paragraph/utterance/token splitting
Text normalization (numbers, symbols, abbreviations)
Lexicon lookup (and if necessary disambiguation) or
LTS, syllabification, stress prediction, morphological analysis; acronym code
Note: lexicon is component in its own right:
Good lexicon coverage prevents prediction errors
Syntactic analysis
POS tagging
Parsing
Sentence type (e.g. yes/no question vs. wh-question)
Prosodic analysis
Prosodic phrasing
Pause assignment
Pitch accent assignment
Post-lexical effects
Special, e.g. dialogue act, identify foreign language inclusions, style/emotion, direct speech, character
31. System architecture view: components, order Not all components in all systems
Very phonemic languages might not need phone level/LTS
Some languages don’t have lexical stress (French), some mark it in orthography (Spanish)
Some languages might not need PLE
HMM itself could model some PLE (depends on set-up)
Can do approximations of morphology, POS tagging, phrasing, prominence
Parser rare
Essential(?): unit splitting, TN, pause, some sort of lexicon
No general semantics
Typical architecture: long pipeline of components
But interactions to consider:
End of utterance detection and abbreviation expansion
POS tagging and pronunciation lookup (homographs)
Syntax and TN
32. Machine learning view Machine learning can be used for all components
But manual rules often still used for TN, pausing, PLE…
Usual choice of supervised, semi-supervised, unsupervised machine learning approaches
Supervised if annotated data available
Very few annotated corpora for TN training
No annotated data for some languages
Lots of manual feature engineering
To accommodate various information sources
33. Machine learning view Supervised machine learning
Classification (Decision trees, SVMs etc.)
Prediction
POS ambiguity class, SMS speak, emotion
Disambiguation
End of utterance, number disambiguation, (semantic) homographs
Sequence
prediction/disambiguation (Classifiers using left/right contexts+previous decisions, HMMs, CRFs etc.)
LTS
need to align data first: one letter to 0, 1 or more phones (EM)
Features: letter context, POS, stress, syllabification, language
POS disambiguation
34. Machine learning view Supervised machine learning (continued)
Classification
Sequence
prediction/disambiguation
…
prosodic boundaries, pauses
Most are “no boundary/pause” ? uninformative context ? use feature: distance from last boundary/pause (Schmid&Atterer 2004)
Issues: Optional pauses, wrong pause worse than missed
Transformation (e.g. Transformation-based learning: TBL)
TN expansion, e.g. “$40.4m” ? “forty point four millions dollars”
PLE using TBL (Webster et al. 2005)
Trees: Parsing
Reduce to classification problem, e.g. shift-reduce parsing
Or use dedicated algorithm, e.g. find maximum spanning tree in directed graph (McDonald et al. 2005)
35. Machine learning view Semi-supervised
Because many components domain-dependent
TN disambiguation, POS tagging, parsing
E.g. adjective “live” much more likely in TV/music domain
But annotated data only available for selected domains
Tagging/parsing general approach
Get unannotated data from new domain
Process with tagger/parser trained on old domain
Add resulting data to training data
Or for HMM tagging if only POS ambiguity classes (i.e. lexicon) given
Use maximum likelihood estimation, or fully Bayesian approach, to get probabilities (e.g. Dufour&Favre 2010)
Speaker-specific training/adaptation of e.g. prominence, pause, phrase prediction???
36. Machine learning view Unsupervised
Examples
For end-of-utterance detection: Build list of possible abbreviations from large corpus by recording which letter strings occur only/not only followed by period
For abbreviation expansion: Find unabbreviated form that starts with abbreviation string and occurs in similar contexts
Syllabification: Learn probabilities of syllable beginnings/endings from beginnings/endings of whole words, then find most likely division into syllables
37. Machine learning view: train/test material Text analysis …
based only on text at synthesis time
but for annotation of training data 2-3 possibilities
From text only, automatically
From text and audio, automatically
For phone, pause, chunk, prominence
From text (and audio if applicable), by hand
Trade-off
What is best (given circumstances)?
2. and 3. allow training of speaker-specific prediction
38. Machine learning view: Evidence Webster et al., Interspeech 2010
Comparing 1. (automatic) with 3. (manual) for annotation of phone sequence in training data
Phone boundaries in both cases through forced alignment (i.e. 2.)
As pilot study found no difference to use of manual boundaries
Phone sequence at synthesis time predicted automatically (i.e. 1.)
Automatic prediction through application of (manually written) PLE rules to canonical pronunciations
No significant difference between preference for automatic vs. manual phone sequence system (41% vs. 43%, 15 listeners, p=0.42)
Only one, quite large (>2000 sentences) corpus used
39. Machine learning view: “higher” information Some linguistic analysis needed by other text analysis components, e.g. POS for homograph disambiguation
But pass output to back-end (duration, F0, spectrum) as well
Useful? Confusing? Overfitting?
Depends on mechanism to prevent overfitting?
Others only for use by back-end (phrasing, prominence)
Typically, the “higher” the linguistic processing, the harder ? lower accuracy/agreement (for humans as well as machines!)
Is it worth doing?
Depends on annotation approach for training data
Depends on mechanism to deal with rich, weak contexts?
40. Machine learning view: Evidence Watts et al., Interspeech 2010
Trained 12 HMM-TTS systems, combinations of
Annotation of training/test data ‘gold’ (3.) or ‘auto’ (1.): g-g, g-a, a-a
Context features (questions) used: lexical, +POS, +phrase, +ToBI
Showed how removing “higher level” features results in more “surrogate” questions about “lower level” features used in trees
Evaluated differences in synthesis quality by preference tests
Levels only compared for gold-gold system (unrealistic)
Only 8 listeners
Only one preference significant: lexical-only vs. all features
But trend: fewer higher level features = lower preference
But hypothesize that effect for “auto” even smaller
Trend: ‘gold’ better than ‘auto’
41. Machine learning view From output of text analysis to context features
Still some freedom
E.g. “use POS”: of current word only, or also preceding/following?
Also: influence of decision tree questions
E.g. “POS feature”: identity question only, or clustering?
42. Conclusion Text analysis for speech synthesis
Complex decisions using variety of information
Rich field for application of machine learning
Importance of lexical coverage and/or domain adaptation
Future
Much information used by humans still not available
Diversity of components and methods: streamline?
Further research needed into what works best in which exact circumstances
Corpus size, language, domain, style, sentence length/complexity, phone/tag set, manual or automatic annotation, its accuracy, other features used, …
43. Linguistic view: punctuation, syntax and prosody