Prosody dependent language modeling based on the correlation between prosody and syntax

Prosody dependent language modeling based on the correlation between prosody and syntax Ken Chen and Mark Hasegawa-Johnson IEEE ASRU 2003 12/03/2003

Y X H Q P W S M A Bayesian network view of a speech utterance X: acoustic-phonetic observ. Y: acoustic-prosodic observ. Q: allophone sequence H: phone-level prosody sequence W: word sequence P: prosody sequence S: syntax M: meaning (including all high level information) Frame Level Segmental Level Word Level

X,Y Q,H W,P S M Prosody dependent speech recognition framework • Advantages: • A natural extension of PI-ASR • Allow the convenient integration of useful linguistic knowledge at different levels • Flexible

Prosody modeled in our system • Two types (ToBI labeled): • The Pitch Accent • The Intonational Phrase Boundary (IPB) • They both are highly correlated with acoustics and syntax. • Pitch accents: pitch encursion (H*, L*); encode syntax information (e.g. content/function word distinction). • IPBs: preboundary lengthening, boundary tones, pause, etc.; Highly correlated with syntactic phrase boundaries

Prosody tagged word transcription Prosody independent word transcription: well what is next Prosody dependent word transcription (Obtained by tagging prosody independent word transcriptions using the corresponding ToBI transcriptions): well_af what_um is_um next_af well_af what_am is_um next_af “a/u”: accented/unaccented “m/f”: IP-medial and IP-final

Prosody dependent language model • A prosody dependent language model p(wj,pj|w1,p1 …wj-1,pj-1 ) models the probability of the current prosodic word given its word and prosody history. • The primary reason of building prosody dependent language models: Our experiment has shown that interaction of prosody dependent language model with prosody dependent acoustic model p(O|W,P) is the key to improve word recognition accuracy

The problem of data sparseness in estimating p(W,P) • Modeling prosody tagged word tokens increases the size of the vocabulary by |p| times (|p|: the variety of word level prosody) w_um Prosody-independent LM Prosody-dependent LM w_uf w w_am w_af

Data sparseness problem in estimating p(W,P) Prosody dependent N-gram models are not able to be as robustly estimated from a limited set of data than prosody independent N-gram models using traditional methods. And the number of unseen prosody dependent bigrams increases.

Y X H Q P W S M Factorial prosodic language model • Motivation: Prosody can be predicted from parts-of-speech tags with high accuracy: • 91% for phrasal stress prediction [Arnfield 94] • 84% pitch accent [Hirschberg 93] • 84% pitch accent, 90% IPB on RNC POS can be inferred from word transcriptions with very high accuracy using automatic syntactic parsers • Solution: • Bridge word and prosody using syntax

The algorithm • Conduct syntactic analysis using automatic syntactic parsers (Charniak’s parser, Roth’s parser, etc.) • Estimate syntactic-prosodic models: p(pi|ci,cj,pi), p(pj|ci,cj), p(ci,cj|wi,wj) • Compute prosody dependent N-gram probabilities p(wj,pj|wi,pi) from prosody independent N-gram probabilities p(wj|wi) or p(wj|wi,pi), and the syntactic-prosodic models • Smoothing

Factorial prosodic language model

Factorial prosodic language model • Both POS and prosody contain limited tokens (around 30 for Penn Treebank POS set) • Hence the syntactic prosodic models p(pi|ci,cj,pi), p(pj|ci,cj), p(ci,cj|wi,wj) can be robustly estimated from a small corpus.

PDLM Smoothing • Katz backoff • Linear Iterpolation

The Corpus • The Boston University Radio News Corpus • Stories read 7 professional radio announcers • 5k vocabulary • 25k word tokens • 3 hours clean speech • No disfluency • Expressive and well-behaved prosody • 85% utterances are selected randomly as training, 5% for development-test and the rest 10% for testing. • A small but the largest prosodically transcribed English corpus

Reduction of Perplexity • Joint Perplexity: 2H(W,P) • Word Perplexity: 2H(W)

Prosody dependent speech recognition experiments on RNC • API: prosody independent allophone set (SPHINX monophone models) • 3 state left-right HMM • 3 mixture Gaussian per state • 32 dimensional MFCC_E_D_Z • APD: prosody dependent allophone set (able to detect prosody induced pitch and durational variation) • State Transition Matrix or duration PDFs are dependent on prosody • A one dimensional single Gaussian acoustic-prosody observation PDFs observing non-linear transform pitch features (using ANNs).

Yq Xq h q Xq q Prosody dependent acoustic modeling • Prosody dependent allophone models Λ(q)=>Λ(q,h): • Acoustic-phonetic observation PDF b(X|q) => b(X|q,h) • Duration PDF d(q) => d(q,h) • Acoustic-prosodic observation PDF f(Y|q,h)

Hw Qw p w Qw w Prosody dependent pronunciation modeling p(Qw|w)=> p(Qw|w,p) =>p(Qw,Hw|w,p) • Model the lexical stress: above: ax b! ah! v! • Model phrasal pitch accent and phrase boundaries through prosody-dependent allophonic models: above ax b ah v above! ax b! ah! v! above% ax b ah% v% above!% ax b! ah!% v!%

Prosody dependent speech recognition experiments on RNC • Word and prosody recognition The Approach proposed in this paper improves WRA by 1%. The WRA of PD-ASR improves 2.5% over PI-ASR that has comparable acoustic model parameter count.

Prosody dependent language modeling based on the correlation between prosody and syntax