130 likes | 139 Views
This study explores the use of artificial neural networks to model pronunciation variation in English spontaneous speech, focusing on the context-dependency of pronunciation changes. The paper delves into the predictive modeling of canonical and surface phones, incorporating distinctive and prosodic features to enhance pronunciation accuracy.
E N D
Modeling pronunciation variation using artificial neural networks for English spontaneous speech Ken Chen and Mark Hasegawa-Johnson
Pronunciation Variability • Manual phonetic transcriptions: TIMIT (14 hours, read speech), ICSI-Switchboard (3.5 hours, spontaneous speech). • Example: “interesting.” 35 tokens in ICSI-Switchboard; arbitrarily pick 8 of them. Total canonical pronunciations: 0. Total different pronunciations: 8. • iy y ih n t r ih s t iy ng • ix n t r ah s t ih ng • ih dx er s t ix ng • ih t r ih s t ih n • ih t r ih s t iy ng • ix n ch r ih s t ih ng • ih n t r ih s t ih ng • ih n ax r ah s t ih ng • (Not all words have this problem: “newspaper” is always produced canonically)
Why not just use a multi-pronunciation dictionary? • Changes are context-dependent: • “and by = ax m b ay” is likely • “and do = ax m d uw” is unlikely • Unnecessary Ambiguity • The entry “and = ax m” makes “and” and “um” (and “them”) indistinguishable
Predictive Pronunciation ModelingRiley & Ljolje, 1995; Riley et al., 1999; Fukada et al., 1999 Canonical phones: cn r eh n ae n d b ay 1 0 … 0 0 … 0 1 … 0 0 … 0 1 … 0 0 … 0 0 … 0 0 … Canonical feature vectors Estimate PDF p(sn= “m” | sn-d,…,sn-1, cn-d,…,cn+d) 0 0 … 0 0 … 0 1 … 0 0 … Surface feature vectors Surface phones: sn DEL er n ae m DEL b ay
The PDF Estimator: Neural Network Similar to Fukada et al., 1999 1 0 … 0 0 … 0 1 … 0 0 … 0 1 … 0 0 … 0 0 … 0 0 … Feature vectors Hidden layer (28-57 nodes) + + + + Output layer: # nodes = # phones + 1 + + + + + + + + normalize Output nodes: zi = p(sn= i | inputs), z0= p(sn= DEL | inputs)
Phone Labels Feature Vectors • Indicator Features (Fukada et al.): • dim(vn) = # phones • vn[i] = 1 iff cn=ith phone, vn[i]=0 otherwise • Multivalued Distinctive Features (DFs) (Riley et al.): • vn = [ consonant_manner, consonant_place, vowel_manner, vowel_place ] • consonant_manner: stop, fric, nasal, glide, affricate • consonant_place: lips, blade, body, larynx • Binary Distinctive Features (DFs) • dim(vn) = 15 • vn is fully specified binary distinctive feature vector • feature specifications based on Stevens, 1999
Inference w/binary distinctive features p( sn = DEL )
Prosodic and Auxiliary Features • In all experiments: • Phone position in word (normalized to [0,1]) • Phone position in syllable (onset vs. rhyme) • Lexical stress (binary) • Function word vs. content word (binary) • Prosodically transcribed data (Yoon et al., ICSLP 2004): • Pitch Accent (binary: presence vs. absence)
Test Metric: Cross Entropy • H(T) = – (1/N) Sn log p( sn | context ) • Context includes • Canonical phones • Surface phones • 4 Auxiliary features (not pitch accent) • Context computed using minimum-Levenshtein-distance alignment • Sum is over all phones, n, in TIMIT/TEST • Baseline: H(T) computed using unigram pronunciation model, p( sn | cn )
Results * Results in this row are from Riley et al., Speech Communication, 1999. Every effort has been made to ensure that the experiments are comparable, but the usual caveats apply.
Results: with Prosody • Insufficient prosodically transcribed, phonetically transcribed data available for both training and test corpora • Testing on the training corpus: • Inclusion of pitch accent as an auxiliary feature reduces cross-entropy by 20% relative to nearly identical pronunciation model without pitch accent • Small training corpus, so significance is unclear
Results: Entropy of Pronunciation Model on Training Data, as a function of NN Training Epoch,with p (pitch accent), and without p
Conclusions • Neural network can learn to predict pronunciation with low cross-entropy • Binary distinctive features (DF) with 3-phoneme context give best performance, but • Difference between binary DF and indicator feature encodings is not statistically significant • Binary DF encoding leads to overtraining when presented with a 5-phoneme context • Prosodic features: results are promising, but data are sparse