280 likes | 383 Views
Meaningful Intonational Variation. Today. Assigning variation for TTS, CTS Contours Accent Phrasing Pitch Range Amplitude and timing. TTS Production Pipeline. Orthographic input: Dr. Smith lives on Elm Dr. Text normalization: abbreviation expansion…
E N D
Today • Assigning variation for TTS, CTS • Contours • Accent • Phrasing • Pitch Range • Amplitude and timing
TTS Production Pipeline • Orthographic input: Dr. Smith lives on Elm Dr. • Text normalization: abbreviation expansion… • Pronunciation modeling: POS id, WS disambiguation • Intonation assignment: parsing, POS id, robust semantics… • Phonetic/phonological realization: phonological parsing, phonetic analysis • Unit selection: acoustic analysis
Intonation Assignment: Phrasing • Traditional: hand-built rules • Punctuation 234-5682 • Context/function word: no breaks after function word He went to dinner • Parse? She favors the nuts and bolts approach • Current: statistical analysis of large labeled corpus • Punctuation, pos window, utt length,…
Functions of Phrasing • Disambiguates syntactic constructions, e.g. PP attachment: • S: You should buy the ticket with the discount coupon. • Disambiguates scope ambiguities, e.g. Negation: • S: You aren’t booked through Rome because of the fare. • Or modifier scope: • S: This fare is restricted to retired politicians and civil servants.
Intonation Assignment: Accent • Hand-built rules • Function/content distinction He went out the back door/He threw out the trash • Complex nominals: • Main Street/Park Avenue • city hall parking lot • Statistical procedures trained on large corpora • Contrastive stress, given/new distinction?
Functions of Pitch Accent • Given/new information • S: Do you need a return ticket. • U: No, thanks, I don’t need a return. • Contrast (narrow focus) • U: No, thanks, I don’t need a RETURN…. (I need a time schedule, receipt,…) • Disambiguation of discourse markers • S: Now let me get you the train information. • U: Okay (thanks) vs. Okay….(but I really want…)
Intonation Assignment: Contours • Simple rules • ‘.’ = declarative contour • ‘?’ = yes-no-question contour unless wh-word present at/near front of sentence • Well, how did he do it? And what do you know? • What else might we do?
Contours: Accent + Phrasing • What do intonational contours ‘mean’ (Ladd ‘80, Bolinger ‘89)? • Speech acts (statements, questions, requests) S: That’ll be credit card? (L* H- H%) • Propositional attitude (uncertainty, incredulity) S: You’d like an evening flight.(L*+H L- H%) • Speaker affect (anger, happiness, love) U: I said four SEVEN one! (L+H* L- L%) • “Personality” S: Welcome to the Sunshine Travel System.
Propositional attitude (uncertainty) Did you feed the animals? I fed the L*+H goldfish L-H% • Distinguish direct/indirect speech acts • Can you open the door?
The TTS Front End Today • Corpus-based statistical methods instead of hand-built rule-sets • Dictionaries instead of rules (but fall-back to rules) • Modest attempts to infer contrast, given/new • Text analysis tools: pos tagger, morphological analyzer, little parsing
TTS: Where are we now? • Natural sounding speech for some utterances • Where good match between input and database • Still…hard to vary prosodic features and retain naturalness • Yes-no questions: Do you want to fly first class? • Context-dependent variation still hard to infer from text and hard to realize naturally:
Appropriate contours from text • Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. • Variation in pitch range, rate, pausal duration to convey topic structure • Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. • How to mimic real voices?
TTS vs. CTS • Decisions in Text-to-Speech (TTS) depend on syntax, information status, topic structure,… information explicitly available to NLG • Concept-to-Speech (CTS) systems should be able to specify “better” prosody: the system knows what it wants to say and can specify how • But….generating prosody for CTS isn’t so easy
To(nes and)B(reak)I(ndices) • Developed by prosody researchers in four meetings over 1991-94 • Goals: • devise common labeling scheme for Standard American English that is robust and reliable • promote collection of large, prosodically labeled, shareable corpora • ToBI standards also proposed for Japanese, German, Italian, Spanish, British and Australian English,....
Minimal ToBI transcription: • recording of speech • f0 contour • ToBI tiers: • orthographic tier: words • break-index tier: degrees of junction (Price et al ‘89) • tonal tier: pitch accents, phrase accents, boundary tones (Pierrehumbert ‘80) • miscellaneous tier: disfluencies, non-speech sounds, etc.
Online training material,available at: • http://www.ling.ohio-state.edu/phonetics/ToBI/ • Evaluation • Good inter-labeler reliability for expert and naive labelers: 88% agreement on presence/absence of tonal category, 81% agreement on category label, 91% agreement on break indices to within 1 level (Silverman et al. ‘92,Pitrelli et al ‘94)
Pitch Accent/Prominence in ToBI • Which items are made intonationally prominent and how? • Accent type: • H* simple high (declarative) • L* simple low (ynq) • L*+H scooped, late rise (uncertainty/ incredulity) • L+H* early rise to stress (contrastive focus) • H+!H* fall onto stress (implied familiarity)
Downstepped accents: • !H*, • L+!H*, • L*+!H • Degree of prominence: • within a phrase: HiF0 • across phrases
Prosodic Phrasing in ToBI • ‘Levels’ of phrasing: • intermediate phrase: one or more pitch accents plus a phrase accent (H- or L- ) • intonational phrase: 1 or more intermediate phrases + boundary tone (H% or L% ) • ToBI break-index tier • 0 no word boundary • 1 word boundary • 2 strong juncture with no tonal markings • 3 intermediate phrase boundary • 4 intonational phrase boundary
L-L% L-H% H-L% H-H% H* L* L*+H
L-L% L-H% H-L% H-H% L+H* H+!H* H* !H*
Contour Examples • http://www.cs.columbia.edu/~julia/cs6998/cards/examples.html
And Other Things Contribute: Pitch Range and Timing (Rate, Pause) • Level of speaker engagement Hello vs. HELLO • Contour interpretation Rise/fall/rise (L*+H L-H%): Elephantiasis isn’t incurable • Discourse/topic structure: paratones
Corpus-Based Research • Predicting accent, phrasing, contours from large ToBI-labeled corpora • Features: • Word position, p.o.s. window, word cooccurence, punctuation, capitalization, sentence length, paragraph position, … • Results: • ~80-85% correct accent prediction • ~92-96% correct phrase boundary prediction • Contours???? • Reality…
This is my version of a rather long sentence which ideally should be broken into several phrases automatically by a smart system but we don't know if this will actually happen do we? • Is a yes-no question uttered with falling intonation? Does that sound delightful? Mellifluous? • I don’t want cereal I want toast. • ….
Next: • Story analysis and generation (readings will be available later this week – we’ll send mail)