160 likes | 253 Views
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch. Contents. Introduction Objectives Articulatory Features Speech Material Experimental details set-up Results Questions, future plans. Introduction.
E N D
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACELouis ten Bosch
Contents • Introduction • Objectives • Articulatory Features • Speech Material • Experimental details • set-up • Results • Questions, future plans
Introduction • Speech is usually represented in terms of sequences from a limited set of phone-like symbols (ASR, synthesis, annotation) • ‘Beads-on-a-string’ paradigm (Ostendorf, 1999; etc) • Powerful as meta description • Weak to describe articulatory variation, pronunciation variation • Research on new descriptions & models of speech • Many proposals for new signal representations (continuity preserving, auditorily inspired) and new models (neural models, long-span models, parallel models) • Here: articulatory features (AF)
Objectives • To obtain alternative representations that intrinsically better model variation in speech • Focus on articulatory/pronunciation variation • To investigate the relation between better representations and decoding
Articulatory Features (AFs) • AF advantages are twofold: • Allow feature asynchrony • Deal with ‘incompleteness’: incomplete nasalization, voicing • Intrinsically better modelling of continuous processes • Assumed to better model fine phonetic details (FPD) • FPD mediate human speech processing (lexical access) • [together with indexical information]
Distance Metric in AF Space • Each utterance is a path in AF space • Distance metric in AF space defines ‘speed’ along path • Compare with delta-features in ASR • Speed peak detection impose intrinsic temporal structure • Which distances to use? • Three types (L1, L2, cosine) • How relates this ‘intrinsic’ temporal structure with external temporal structure e.g. phone boundaries?
Speech Material • IFAcorpus (Dutch, read + prepared, 8 speakers, 6 used for training and development, 2 for test) • Many different rich annotation levels
Alignment Results • Nbr of hits (detected -> observed) versus time window size: Wesenick & Kipp ‘96
Asynchrony and Phonetic Classes Average (in number of frames) and standard deviation of the difference (diff.) between cosine-peak location and manual boundary. Only the transitions with extreme negative and positive distances are shown. Manner transition avg. (st.dev.) Fricative-fricative -0.57 (1.6) Vowel-vowel -0.31 (1.8) …. Silence-approximant 0.49 (1.8) Approx.-stop 0.63 (1.6) Vowel-silence 0.64 (2.1) Nasal-approx 0.66 (1.0)
Open questions 1 • To what extent the type of distance (L1, L2, cosine) distinguishes fine detail in the alignment with manual segmentation? • For distances close to 0, all metrics will provide about the same result • The metrics deviate for larger distances, thereby putting more weight to different types of distinctions • This means that event parsing along the AF trajectory may result into essentially different segmentations along the trajectory for different metrics.
Open questions 2 • What about the cue trading (by using weights)? • Difficult, depends on phone • What about the precise quantification of asynchrony? • The variation of observed AF vectors around a canonical AF vector = feature asynchrony + the variation in the classifier output
Near-future plans • Exploit phenomena described here in terms of design principles for alternative procedures for data-driven annotation and unit selection • Design word recognition framework based on AF representation of speech • Study usability for memory-prediction models