180 likes | 267 Views
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY. R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols. Institute of Phonetic Sciences / ACLC University of Amsterdam, Herengracht 338, 1016 CG Amsterdam, The Netherlands tel: +31 20 5252183; fax: +31 20 5252197 email: Rob.van.Son@hum.uva.nl
E N D
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University of Amsterdam, Herengracht 338, 1016 CG Amsterdam, The Netherlandstel: +31 20 5252183; fax: +31 20 5252197 email: Rob.van.Son@hum.uva.nl ICSLP2000, Beijing, China, Oct. 20, 2000
INTRODUCTION • Speech is "efficient":Important components are emphasized Less important ones are de-emphasized • Two mechanisms: 1) Prosody:Lexical Stress and Sentence Accent (Prominence) 2) Predictability:Frequency of Occurrence (tested)and Context (not tested)
MECHANISMS FOR EFFICIENT SPEECH Speech emphasis should mirror importance which largely corresponds to unpredictability • Prosodic structure distributes emphasis according to importance (lexical stress, sentence accent / prominence) • Speakers can (de-)emphasize according to supposed (un)importance • Speech production mechanisms can facilitate redundant speech or hamper unpredictable speech
QUESTIONS • Can the distribution of emphasis or reduction be completely explained from Prosody? (Lexical stress and Sentence Accent / Prominence) • If not, can we identify a speech production mechanism that would assist efficiency in speech? e.g. preprogrammed articulation of redundant and / or high-frequent syllable-like segments?
SPEECH MATERIAL (DUTCH) • Single Male Speaker: Vowels and Consonants Matched Informal and Read speech, 791 matched VCV pairs • Polyphone: Vowels only273 speakers (out of 5000), telephone speech, 1244 read sentences Segmented with a modified HMM recognizer (Xue Wang) • Corpora sizes: Number of realizations of vowels and consonants Unstressed Stressed Total Corpus Accent – + – + Single consonants 550 180 569 283 1582 Speaker vowels 812 461 528 224 2025 Polyphone vowels 4435 4942 9603 3516 22496 • Accent: Sentence accent / Prominence • Stressed/Unstressed: Lexical stress
METHODS: SPEECH PREPARATION • Single speaker corpus • All 2 x 791 VCV segments hand-labeled • Also sentence accent determined by hand • 22 Native listeners identified consonants from this corpus • Polyphone corpus • Automatically labeled using a pronunciation lexicon and a modified HMM recognizer • 10 Judges marked prominent words (prominence 1-10) • Word and Syllable -log2(Frequencies) for both corpora were determined from Dutch CELEX
METHODS: ANALYSISSingle Speaker CorpusConsonants and Vowels • Duration in ms (vowels and consonants) • Contrast (vowels only)F1 / F2distance to (300, 1450) Hz in semitones • Spectral Center of Gravity (CoG) (V and C)Weighted mean frequency in semitones at point of maximum energy • Log2(Perplexity)from consonant identification Calculated from confusion matrices
METHODS: ANALYSISPolyphone CorpusVowels only • Loudness in sone • Spectral Center of Gravity (CoG) Weighted mean frequency in semitones averaged over the segment • Prominence (1-10)The number of 'PROMINENT' listener judgements0 – 5 is considered Unaccented6 –10 is considered Accented
Consonants Duration x CoG Duration x Px (n=1582) CoG x Px Vowels Duration x Contr. (n=2025) Duration x CoG Contrast x CoG Polyphone G I Loudness x CoG (n=22496) Filled: p<=0.01 CONSISTENCY OF MEASUREMENTS Correlation coefficients between factors } G Single Speaker E S A 2 C Polyphone Filled symbols: P<=0.01 • Duration in ms • Loudness in sones • CoG: Spectral Center of Gravity (semitones) • Px: log2(Perplexity) plotted is –R • Contrast:F1/ F2distance to (300, 1450) Hz (semitones)
Duration CoG Perplexity Filled: p<=0.01 CONSONANT REDUCTION VERSUS FREQUENCY OF OCCURRENCE (correlation coefficients) Single speaker corpus (n=1582) G E A Filled symbols: P<=0.01 • CoG: Spectral Center of Gravity (semitones) • Perplexity: log2(Perplexity), plotted is –R. • Syllable and word frequencies were correlated (R=0.230, p=0.01)
VOWEL REDUCTION VERSUS FREQUENCY OF OCCURRENCE (correlation coefficients) Single speaker corpus (n=2025) Filled symbols: P<=0.01 • Duration in ms • Contrast: F1/ F2 distance to (300, 1450) Hz (semitones) • CoG: Spectral Center of Gravity (semitones) • Syllable and word frequencies were correlated (R=0.280, p<=0.01)
DISCUSSION OF SINGLE SPEAKER DATA • There are consistent correlations between frequency of occurrence and “acoustic reduction” (duration, CoG and contrast), but not for consonant identification (perplexity) • Correlations for syllable frequencies tend to be larger than those for word frequencies (p0.01) • Correlations were found after accounting for Phoneme identity, Lexical Stress and Sentence Accent
PROMINENCE VERSUS VOWEL REDUCTION AND FREQUENCY OF OCCURRENCE (correlation coefficients) Polyphone corpus (n=22496) G Loudness E CoG C Syllable freq. A Word freq. Filled: p<=0.01 Filled symbols: P<=0.01 • Loudness (sone) • CoG: Spectral Center of Gravity (semitones) • Syllable and word frequencies (-log2(freq))
VOWEL REDUCTION VERSUS FREQUENCY OF OCCURRENCE (correlation coefficients) Polyphone corpus (n=22496) Filled symbols: P<=0.01 Accent: + Prom > 5 – Prom <= 5 • Loudness (sone) • CoG: Spectral Center of Gravity (semitones) • Syllable and word frequencies were correlated (R=0.316, p<=0.01)
DISCUSSION OF POLYPHONE DATA • Perceived prominence correlates with “acoustic vowel reduction” (loudness, CoG) and frequency of occurrence (syllable and word) • There are small but consistent correlations between “acoustic vowel reduction” and frequency of occurrence • Correlations were found after accounting for Vowel identity,Lexical Stress and Prominence
CONCLUSIONS • LEXICAL STRESS and SENTENCE ACCENT / PROMINENCE cannot explain all of the “efficiency” of speech: FREQUENCY OF OCCURRENCE and possibly CONTEXT in general are needed for a full account • A SYLLABARY which speeds up (and reduces) the articulation of “stored”, high-frequency, syllables with respect to “computed”, rare, syllables might explain at least part of our data
SPOKEN LANGUAGE CORPUSHow Efficient is Speech • 8-10 speakers: ~60 minutes of speech each (fixed and variable materials) • Informal story telling and retold stories ~15 min • Reading continuous texts ~15 min • Reading Isolated (Pseudo-) sentences ~20 min • Word lists ~ 5 min • Syllable lists ~ 5 min
MEASURINGSPEECH EFFICIENCY • Speaking Style differences (Informal, Retold, Read, Sentences, Lists) • Predictability • Frequency of Occurrence (words and syllables) • In Context (language models) • Cloze-tests • Shadowing (RT or delay) • Acoustic Reduction • Segment identification • Duration • Spectral reduction