A new Golden Age of phonetics?

A new Golden Age of phonetics? Mark Liberman University of Pennsylvania myl@cis.upenn.edu

The promise We see that the computer has opened up to linguists a host of challenges, partial insights, and potentialities. We believe these can be aptly compared with the challenges, problems, and insights of particle physics. Certainly, language is second to no phenomenon in importance. And the tools of computational linguistics are considerably less costly than the multibillion-volt accelerators of particle physics. The new linguistics presents an attractive as well as an extremely important challenge. There is every reason to believe that facing up to this challenge will ultimately lead to important contributions in many fields. Language and Machines: Computers in Translation and Linguistics Report by the Automatic Language Processing Advisory Committee (ALPAC), National Academy of Sciences, 1966

The paradox • ALPAC (and Pierce 1969): • computers → new language science • language science → language engineering • What actually happened: • computers → new language engineering • engineering → new language science ???

Focusing on speech science… • Plenty of computer use • minicomputers in the 1960s • micro- and super-computers in the 1980s • ubiquitous laptops today • Applications: • replaced tape splicing • replaced sound spectrograph • easier pitch tracking, formant tracking • more convenient statistical analysis • and so on • BUT…

No phonetic quantum mechanics • Great speech science by smart people • But surprisingly little change • in style and scale of research 1966-2009 • in scientific questions about speech • in the rate of progress compared to 1946-1966(the first golden age of phonetics) …at least on the acoustic analysis side. • Peterson & Barney 1951 • data is still relevant • many contemporary publications are similar in style and scale

“Challenges, partial insights, and potentialities” • In speech science • there are many unmet challenges, • not enough new insights, • and the potentialities are still mostly potential • Something was missing in 1966 • adequate accessible digital speech data • tools for large-scale automated analysis • applicable research paradigms • We now have two out of three…

“Challenges…” • Variation and invariants • Individual • Contextual • Social • Communicative • Problem of correlated variables • Factorial design vs. MLM • Laboratory vs. natural data • Descriptive dimensions? • e.g. F0/amplitude/time/etc. for prosody

“…partial insights…” […in fact we know a lot about speech…]

“…potentialities” • 4-6 orders of magnitude more speech • “Found data” as well as gifts from DARPA • Purely bottom-up analysis • F0, Voice/noise/silence, speaking rate • New descriptive dimensions • Analysis based on forced alignment • Good results from OK transcripts • Qualitative: “pronunciation modeling” • Quantitative: old and new dimensions • Better statistical methods

“…important contributions in many fields…” • Social sciences • Sociolinguistics and dialect geography • New approaches to survey data • Phonetics of rhetoric • Speech pathology/therapy • Language teaching/learning

Two kinds of science • Explore, observe, explain Yogi Berra:“Sometimes you can observe a lot just by watching” “Botanizing” / exploratory data analysis 2. Hypothesize and test

(11,700 conversational sides; mean=173, sd=27)(Male mean 174.3, female 172.6: difference 1.7, effect size d=0.06)

Data from Switchboard; phrases defined by silent pauses (Yuan, Liberman & Cieri, ICSLP 2006)

Data from CallHome M/F conversations; about 1M F0 values per category.

Evanini, Isard & Liberman, “Automatic formant extraction for sociolinguistic analysis of large corpora”, Interspeech 2009

Yuan & LibermanInterspeech 2009 • Orthographically-transcribed natural speechis available in very large quantities • With pronunciation modeling and forced alignmentwe can use this data for phonetics research • Automatic acoustic measures based on simple statistical modelscan sometimes be helpful • Here we examine the distribution of /l/-darknessin ~26 (out of ~9000) hoursof U.S. Supreme Court oral arguments • … ~22,000 tokens of /l/

25 Introduction • English /l/ is traditionally classified into at least two allophones:“dark /l/”, which appears in syllable rimes “clear /l/”,which appears in syllable onsets. • Sproat and Fujimura (1993) : clear and dark allophones are not categorically distinct; single phonological entity /l/ involves two gestures – a vocalic dorsal gesture and a consonantal apical gesture. • The two gestures are inherently asynchronous: the vocalic gesture is attracted to the nucleus of the syllable the consonantal gesture is attracted to the margin (“gestural affinity”). • In a syllable-final /l/, the tongue dorsum gesture shifts left to the syllable nucleus, making the vocalic gesture precede the consonantal, tongue apex gesture. In a syllable-initial /l/, the apical gesture precedes the dorsal gesture. Yuan & Liberman: Interspeech 2009

26 Introduction • Clear /l/ has a relatively high F2 and a low F1; Dark /l/ has a lower F2 and a higher F1; Intervocalic /l/s are intermediate between the clear and dark variants (Lehiste 1964). • An important piece of evidence for the “gestural affinity” proposal:Sproat and Fujimura (1993) found that the backness of pre-boundary intervocalic /l/ (in /i - ɪ/) is correlated with the duration of the pre-boundary rime. The /l/ in longer rimes is darker. • S&F (1993) devised a set of boundaries with a variety of strengths, to ‘elicit’ different rime durations in laboratory speech: Major intonation boundary: Beel, equate the actors. “|” VP phrase boundary: Beel equates the actors. “V” Compound-internal boundary: The beel-equator’s amazing. “C” ‘#’ boundary: The beel-ing men are actors. “#” No boundary: Mr Beelik wants actors. “%” Yuan & Liberman: Interspeech 2009

27 Introduction • Figure 1 in Sproat and Fujimura (1993): Relation between F2-F1 (in Hz) and pre-boundary rime duration (in s) for (a) speaker CS and (b) speaker RS. Yuan & Liberman: Interspeech 2009

28 Introduction • Figure 4 in Sproat and Fujimura (1993): A schematic illustration of the effects of rime duration on pre-boundary post-nuclear /l/. Yuan & Liberman: Interspeech 2009

29 Introduction • Huffman (1997) showed that onset [l]s also vary in backness: the dorsum gesture for the intervocalic onset [l]s (e.g., in below) may be shifted leftward in time relative to the apical gesture, which makes a dark(er) /l/. • The data utilized in these studies (both F&S and Huffman)comprised only a few hundred tokens of /l/ in laboratory speech. • “The relation of duration and backness can be complicated by differences in coarticulatory effects of neighboring vowels, or by speaker-specific constraints on absolute degree of backness.” (Huffman 1997). => Our study uses a very large speech corpus where these complications average out. • Automatic formant tracking is error-prone, and it is time-consuming to measure formants by hand. => We develop a new method to quantify /l/ backnesswithout formant tracking. Yuan & Liberman: Interspeech 2009

30 Our Data • The SCOTUS corpus includes more than 50 years of oral arguments from the Supreme Court of the United States – nearly 9,000 hours in total. For this study, we used only the Justices’ speech (25.5 hours) from the 2001-term arguments, along with the orthographic transcripts. • The phone boundaries were automatically aligned using the PPL forced aligner trained on the same data, with the HTK toolkit and the CMU pronouncing dictionary. • The dataset contains 21,706 tokens of /l/, including 3,410 word-initial [l]s, 7,565 word-final [l]s, and 10,731 word-medial [l]s. Yuan & Liberman: Interspeech 2009

31 The Penn Phonetics Lab Forced Aligner • The aligner’s acoustic models are GMM-based monophone HMMs on 39 PLP coefficients. The monophones include: speech segments: /t/, /l/, /aa1/, /ih0/, … (ARPAbet)‏ non-speech segments: {sil} silence; {LG} laugh; {NS} noise; {BR} breath; {CG} cough; {LS} lip smack {sp } is a “tee” model with a direct transition from the entry to the exit node in the HMM (so “sp” can have 0 length) .... used for handling possible inter-word silence. • The mean absolute difference between manual and automatically-aligned phone boundaries in TIMIT is about 12 milliseconds. • http://www.ling.upenn.edu/phonetics/p2fa/ Yuan & Liberman: Interspeech 2009

32 Forced Alignment Architecture Word and phone boundaries located Yuan & Liberman: Interspeech 2009

33 Method • To measure the “darkness” of /l/ through forced alignment, we first split /l/ into two phones, L1 for the clear /l/ and L2 for the dark /l/, and retrained the acoustic models for the new phone set. • In training, word-initial [l]’s (e.g., like, please) were categorized as L1 (clear); the word-final [l]s (e.g., full, felt) were L2 (dark). All other [l]’s were ambiguous, which could be either L1 or L2. • During each iteration of training, the ‘real’ pronunciations of the ambiguous [l]’s were automatically determined,and then the acoustic models of L1 and L2 were updated. • The new acoustic models were tested on both the training data and on a data subset that had been set aside for testing. During the tests, all [l]’s were treated as ambiguous – the aligner determined whether a given [l] was L1 or L2. Yuan & Liberman: Interspeech 2009

34 Method • An example of L1/L2 classification through forced alignment: Yuan & Liberman: Interspeech 2009

35 Method • If we use word-initial vs. word-final as the gold standard, the accuracy of /l/ classification by forced alignment is 93.8% on the training data and 92.8% on the test data. L1 L2 L1 2987 235 (training data) L2 414 6757  gold-standard by word position L1 169 19 L2 23 371 (test data)  classified by the aligner • These results suggest that acoustic fit to clear/dark allophones in forced alignment is a plausible way to estimate the darkness of /l/. Yuan & Liberman: Interspeech 2009

36 Method • To compute a metric to measure the degree of /l/-darkness, we therefore ran forced alignment twice. All [l]’s were first aligned with L1 model, and then with the L2 model. • The difference in log likelihood scores between L2 and L1 alignments – the D score – measures the darkness of [l]. The larger the D score, the darker the [l]. The histograms of the D scores: Yuan & Liberman: Interspeech 2009

37 Results • To study the relation between rime duration and /l/-darkness, we use the [l]s that follow a primary-stress vowel (denoted as ‘1’). • Such [l]s can precede a word boundary (‘#’), or a consonant (‘C’) or a non-stress vowel (‘0’) within the word. Yuan & Liberman: Interspeech 2009

38 Results From the figure we can see that: • The [l]s in longer rimes have larger D scores, and hence are darker. This result is consistent with Sproat and Fujimura (1993). • The [l]s preceding a non-stress vowel (1_L_0) are less dark than the [l]s preceding a word boundary (1_L_#) or a consonant (1_L_C). • The relation between duration and darkness for 1_L_C is non-linear. The /l/ reaches maximum darkness when the stressed vowel plus /l/ is about 150-200 ms. • The syllable-final [l]s are always dark, even in very short rimes.This contradicts Sproat and Fujimura (1993)’s finding that the syllable-final /l/ in very short rimes is as clear as the canonical clear /l/. Yuan & Liberman: Interspeech 2009

39 Results • To further examine the difference between clear and dark /l/, we compare the intervocalic (1_L_0) – syllable-final or "ambisyllabic" with the intervocalic (0_L_1) - syllable-initial • The “rime” duration here meansthe duration of the previous vowel plus the duration of [l] regardless of putative syllabic affinity…. Yuan & Liberman: Interspeech 2009

40 Results From the figures we can see that: • The intervocalic syllable-final [l]s have positive D scores whereas the intervocalic syllable-initial [l]s have negative D scores. • There is a positive correlation between darkness and rime duration(i.e., the duration of /l/ and its preceding vowel) for the intervocalic syllable-final [l]s, but no correlation for the intervocalic syllable-initial [l]s. • For the intervocalic syllable-final /l/, there is a positive correlation between /l/ duration and darkness. No correlation between /l/ duration and darkness was found, however, for the intervocalic syllable-initial /l/. • These results suggest that there is a clear difference between the intervocalic syllable-final and syllable-initial /l/s. Yuan & Liberman: Interspeech 2009

41 Conclusions • We found a strong correlation between the rime duration and /l/-darkness for syllable-final /l/. This result is consistent with Sproat and Fujimura (1993). We found no correlation between /l/ duration and darkness for syllable-initial /l/. This result is different from Huffman (1997). • We found a clear difference in /l/ darkness between the 0_1 and 1_0 stress contexts, across all values of V+/l/ duration and of /l/ duration. • We found that the syllable-final /l/ preceding a non-stress vowel was less dark than preceding a consonant or a word boundary. Also, there was a non-linear relationship between timing and quality for the /l/ preceding a consonant and following a primary-stress vowel. These segments reach a peak of darkness when the duration of the stressed vowel plus /l/ is about 150-200 ms.Further research is needed to confirm and explain these results. Yuan & Liberman: Interspeech 2009

Meta-Conclusion Large “found” collections of speech can be used effectively in phonetics research. Better pronunciation modelingand better forced alignment will be helpful. But the existing technology is good enough to start with.

A new Golden Age of phonetics?