Speech acoustics and phonetics

Speech acoustics and phonetics Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) NATO-ASI “Dynamics of Speech Production and Perception” Il Ciocco, Tuscany, Italy, July 1, 2002

Overview • Dynamics in speech acoustics • Contour modeling (mainly formants) • Aspects of spectral undershoot • Modeling V and C reduction • Phonetic knowledge from speech corpora • IFA, CGN, TIMIT, found speech • Conclusions Speech acoustics and phonetics, Il Ciocco

Dynamics in speech acoustics • Dynamics is the norm, not stationarity • articulatory efficiency • Dynamics is everywhere • generally no word boundaries in speech • deletion of words, syllables, phonemes; insertion • within/between word coarticulation/assimilation • vowel and consonant reduction • Acoustic manifestations • segment duration, F0, loudness, spectral quality Speech acoustics and phonetics, Il Ciocco

Dynamics is the norm • The speaker speaks as sloppily as the listeners allow him to do in communication • communicative efficiency • Articulatory vs. perceptual efficiency • do spectral transitions facilitate or hamper perception? —> see other presentation • Speaker flexibility; speaking style (clear vs. sloppy); speaking rate Speech acoustics and phonetics, Il Ciocco

Dynamics is everywhere • Deletion • ‘bread and butter’ /brEmbY3/ • ‘Amsterdam’ (Du) /Amst@rdAm/ —>/Ams@dAm/ • ‘koninklijke’ (Du) /konIŋkl@k@/ —>/kol@k@/ • Insertion • homorganic glide insertion: ‘die een’ (Du) /dij@n/ • Degemination • ‘is zichtbaar’ (Du) /Is zIxtbar/ —>/IsIxbar/ • Reduction, coarticulation, assimilation Speech acoustics and phonetics, Il Ciocco

Acoustic manifestations • pitch, loudness, formant, component contours • contour stylization (e.g., pitch in praat) • contour modeling • n-th degree curve fitting (D.van Bergem) • Legendre polynomials ) (R.van Son) • 16 points per segment ) • (phoneme) segmentation • by hand (time consuming; non-consistent) • automatically (via forced phoneme recognition and a pronunciation lexicon with alternatives; systematic errors) Speech acoustics and phonetics, Il Ciocco

Contour modeling • allows modeling of specific phenomena • pitch accentuation (vs. vowel onset) • reduction, centralization, undershoot • allows generation of stimuli for perc. expts. • phoneme identification in extending context • 2-alternatives forced choice identif. of continua • discrimination, RT • allows statistics on large speech corpora • TIMIT, CGN, IFA-corpus, Switchboard Speech acoustics and phonetics, Il Ciocco

Static vs. dynamic V recogn. • see Weenink (2001) • “Vowel normalizations with the TIMIT acoustic phonetic speech corpus”, IFA Proc. 24, 117-123 • 438 males, both train & test sent. of TIMIT • 35,385 vowel segments, hand segmented • 13 monophthongeal vowel categories • 1-Bark bandfilter anal. (18), intensity. normal. • 3 frames per segment: central and 25 ms L/R Speech acoustics and phonetics, Il Ciocco

Some results • Vowel classif. (%) with discriminant functions Speech acoustics and phonetics, Il Ciocco

Formant tracks / speaking rate • Ph.D. thesis Rob van Son (1993) • “Spectro-temporal features of vowel segments” • see also Speech Comm. 13, 135-148 (Pols & vSon) • 850-words text, read at normal and fast rate • hand segmentation of 7 most freq. V + schwa • formant tracks • via 16 points per segm. or 5 Legendre polynomials • influence of rate, V-dur., context, sent. acc. • evidence for duration-controlled undershoot? Speech acoustics and phonetics, Il Ciocco

Some results • no differences for F1/F2 in vowel center for normal- or fast-rate speech; only some overall rise in F1 for fast rate (irrespective of V) • same formant track shape (normalized to 16 points) for normal- or fast-rate speech • same results when using the more elaborate Legendre polynomials • Concl.: changes in V-duration do not change the amount of undershoot —> active control of articulation speed Speech acoustics and phonetics, Il Ciocco

Formant representations e e zeroth order Legendre Legendre polynomial coefficients (mean Fi in vowel segment) second order polynomials (axes reversed) Speech acoustics and phonetics, Il Ciocco

Modeling vowel reduction • Ph.D. thesis Dick van Bergem (1995) • “Acoustic and lexical vowel reduction” • see also Speech Communication 16, 329-358 • lexical V reduction Fr /betõ/ vs. Du /b@tOn/ • acoustic V reduction /banan, bAnan, b@nan/ • f(sent. acc., w. str., w. class): can-candy-canteen • coarticulatory effects on the schwa • C1@C2V- and VC1@C2-typenonsense words • perceptual effects (full V or schwa, f.i. ‘ananas’) Speech acoustics and phonetics, Il Ciocco

Some results t-n w-l The schwa is not just a centralized vowel but something that is completely assimilated with its phonemic context Speech acoustics and phonetics, Il Ciocco

Modeling consonant reduction • Sp. Comm. (1999) 28, 125-140 (vSon & Pols) • 20 min. speech, both spontaneous and read • 2 x 791 similar VCV; hand segmented • 5 aspects of V and C reduction • related to coarticulation: F2 slope differences at CV- vs. VC-boundaries; F2 locus equations (F2 onset vs. F2 target) • related to speaking effort: duration; spectral COG (mean freq.); V-C sound energy differences Speech acoustics and phonetics, Il Ciocco

Some results • V markedly reduced in spontaneous speech • lower F2-slope diff. in spontaneous speech —> decrease in articulation speed • no systematic effect on F2 locus equation; V onsets and targets change in concert —> any V reduction mirrored by comparable change in C • spont. sp.: V and C shorter; lower COG —> decrease in vocal and articulatory effort Speech acoustics and phonetics, Il Ciocco

Access to large corpora • more, and more realistic, data • phonetic knowledge via statistical analyses • f.i. highly accessible IFA-corpus (free, SQL) • see “Structure and access of the open source IFA-corpus”, IFA Proc. 24, 15-26 (vSon & Pols) • on-line http://www.fon.hum.uva.nl/IFAcorpus/ • 4 M/4F speakers, 5.5 hrs of speech • from informal to read + sent., words, syllables • ~ 50Kwords segm. and labeled at phoneme level Speech acoustics and phonetics, Il Ciocco

Some results • speech + annot. + meta data: relational DB • realization of final n, f.i. Du ‘geven’ /xev@(n)/ Read Speech acoustics and phonetics, Il Ciocco

Spoken Dutch Corpus (CGN) • 10 M words, 1,000 hrs of speech • variety of styles, incl. telephone speech • adult Dutch and Flemish speakers • for linguistic and technological research • see various LREC and ICSLP papers (2002) • see also http://lands.let.kun.nl/cgn/home.htm • fully transcribed: orthogr., POS, lemmas • partly transcr.: phonemic, prosodic, syntactic Speech acoustics and phonetics, Il Ciocco

TIMIT • popular DB in acoustic phonetics and ASR • also telephone version (NTIMIT) • hand segmented & labeled at phoneme level • 438 males, 192 females (8 dialect regions) • 10 sent./sp. (2 fixed, 1 phon. compact, 7 diverse) sa1: “She had her dark suit in greasy wash water all year” • includes separate test data (112 M, 56 F) • e.g. Ph.D thesis X. Wang (1997) “Incorporating knowledge on segmental duration in HMM-based continuous speech recognition” Speech acoustics and phonetics, Il Ciocco

overall average=95 ms normal rate=95 primary stress=104 word final=136 utterance final=186 Useful info: durational variability Adopted from Wang (1998) Speech acoustics and phonetics, Il Ciocco

all 3,696 training sent. (sx + si) of TIMIT training set 0 normalized phone duration speaking rate

‘found’ speech • DARPA-LVSR community rather ambitious • Broadcast News (BN), Sp.Comm. 37 (2002) For Proc. DARPA Workshops, see http://www.nist.gov/speech/proc/darpa99/index.htm Speech acoustics and phonetics, Il Ciocco

Articul.-acoustic features in ASR • “A Dutch treatment of an elitist approach to articulatory-acoustic feature classification”, Proc. Eurospeech-2001, 1729-1732 (M. Wester et al.) • “Integrating articulatory features into acoustic models for speech recognition”, Phonus 5, 73-86 (K. Kirchhoff, 2000) • “An overlapping-feature-based phonological model incorporating linguistic constraints: Applications to speech recognition”, JASA 111 (2), 1086-1101 (J. Sun & L. Deng, 2002) Speech acoustics and phonetics, Il Ciocco

Conclusions • examples of dynamics in speech acoustics • going from formal to informal speech: • less dynamics, more reduction (artic. guided) • undershoot vs. speaking style • sloppiness or articulatory limits? • functionality of dynamics? —> other paper • systematicity of dynamics? • easing ASR, rules for TTS, acquiring knowledge? Speech acoustics and phonetics, Il Ciocco

Speech acoustics and phonetics