Informative Dialect Identification

Informative Dialect Identification Nancy Chen Oct. 31, 2008

Dialects, Accents, and Languages

Language Recognizer or L1 detector? Language Recognizer Indian English Hindi

Automatic Speech Recognizers I only understand English. You are speaking a foreign language. Indian English

Traditional Automatic Recognizers speech 18, 53, … • Big black box • Input features not intuitive • Not F0, F1, F2 • Thousands of Gaussians, each with 40+ dimensions • Efficiently process lots of data • Hard to interpret models and results • Training data ~ 100+ hrs Traditional automatic recognizers

Linguistic Studies Spread the peanut butter • Few speakers • 20-30 at most • Perceptual analysis takes much time and effort • Phonological rules Linguistic studies

American English Speaker Spread the peanut butter • Voiceless stop consonants are unaspirated when preceded by fricatives • “p” in spread sounds more like “b” • Intervocalic /t/ flapped when followed by unstressed syllable • “t” in butter does not produce intra-oral pressure Linguistic studies

Indian English Speaker I can’t spread the peanut butter with Harr • Voiceless stop consonants are always unaspirated • /p/, /t/, /k/ sound like /b/, /d/, /g/ • Inter-dental fricatives become stop-like • “the” sounds like “de” • Alveolar consonants /t/, /d/, /n/ are retroflex • /w/ /v/ • British English influence • Rhoticity gone when “vowel + /r/” • /ae/  /a/; e.g., bath, can’t Linguistic studies

Goal Spread the peanut butter speech 18, 53, … Traditional automatic recognizers Linguistic studies Informative dialect identification …… [t][er][dx][er] …… speech American English

Potential Applications • Forensic phonetics • Speaker recognition and characterization • Automated speech recognition and synthesis • Accent training education • Articulatory and phonological disorder diagnosis

Challenges • Automatic phone recognition limitations • State-of-the-art “phone recognition” accuracy only 50-60% • Commercial speech recognition rely heavily on grammar and social context • Inadequately capture dialect differences • e.g. retroflex [t] recognized as typical [t], [r], [ax], … • Sub-dialects within Indian English

Related Research • Automatic speech recognition for non-native speech(Fung 2005, Livescu 2000) • Accent classification(Angkititrakul, Hansen 2006) • Language identification(Li, Ma, Lee 2007)

Techniques • Acoustic modeling (e.g., Torres-Carrasquillo et al. 2004) • Gaussian mixture models, hidden Markov models • N-grams of phonetic units (e.g., Zissman 1995) • Models the “grammar” of phones • PRLM (Phone Recognition followed by Language Modeling) • Our approach: acoustic modeling of dialect-discriminating phonetic contexts

Terminology & Notation • Monophone • e.g. [t], [a] • Biphone: a monophone in the context of other phones • Phonetic notation: • e.g. [k-r] is an [r] preceded by [k] • e.g. [t+a] is [t] followed by [a] • Mathematical notation: biphone variable b is phone  followed by phone ; ,  ={monophone set} • Only consider two dialects d={d1, d2} • d1: American English • d2: Indian English

Finding Dialect-Specific Phonological Rules • Supervised Learning • If phone transcriptions are available • Unsupervised Learning • If no phone transcriptions are available

Supervised Classification • Extract phonological rules • Adapt biphone models • Dialect recognition task via likelihood ratio test

Supervised Rule Extraction: Example 1 Indian English Phone recognizer  wine American English Phone recognizer  vine • Recognition accuracy of the recognizer-hypothesized [v] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [v] differs across dialects

Supervised Rule Extraction: Example 2 Indian English Phone recognizer  pat American English Phone recognizer  bat • Recognition accuracy of the recognizer-hypothesized [b] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [b] differs across dialects

Supervised Rule Extraction: Example 3 Indian English Phone recognizer  beats American English Phone recognizer  butter • Recognition accuracy of the recognizer-hypothesized [dx+er] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [dx+er] differs across dialects

Rule Extraction Criteria • Biphone b is dialect-discriminating for dialect d1 and d2 if • The recognition accuracy of biphone b in dialect d1 is different from that in dialect d2 • The occurrence frequency of biphone b is sufficient equations

Adapt Biphone Models adapt Dialect-neutral monophone model American-English-specific monophone model

Adapt Biphone Models adapt Dialect-neutral monophone model American-English-specific monophone model adapt American-English-specific monophone model American-English-specific biphone model

Dialect Recognition: likelihood scores Log Likelihood American-English biphone models Test utterance Indian-English biphone models Log Likelihood

Dialect Recognition: likelihood ratio test Log Likelihood Log Likelihood Ratio Test American-English biphone models Test utterance Indian-English biphone models Log Likelihood

Dialect Recognition: decision making Log Likelihood Log Likelihood Ratio Test American-English biphone models Detection Error Analysis Test utterance Indian-English biphone models Log Likelihood Threshold Determination Dialect decision

Unsupervised Classification • Unsupervised rule extraction • Adapt all biphone models • Prune out non-dialect-specific biphone models • Dialect recognition via likelihood ratio test

Retaining Biphone Models: Example Dialect-netural monophone model American English

Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English

Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model

Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model The larger the log likelihood ratio of biphone [dx+er], the more dialect-specific [dx+er] is of American English

Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood

Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood

Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Keep  Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood equations

Experimental Setup • Training set: • 104 hrs of dialect-marked data without transcriptions • Test set: • 1298 American English trials • 200 Indian English trials • Each trial is 30 seconds • Dialect-neutral monophone HMM models: • trained on 23 hrs of transcribed data • 47 English monophones

Pilot Study: Dialect-Specific [r] Biphones • Recognizer decoded [r] instances were manually labeled in both dialects

Detection Error Trade-off Curve EER= Equal Error Rate

Discussion • [r]-biphones performs at least as well as monophones • [r]-biphones performs better when false alarms are penalized more • [r]-biphones not necessarily interpretable • Phone recognition errors • Rules only learned from minimal transcriptions (~ 1min speech) • Sub-dialect issues with Indian English. Rules derived from speakers with Hindi as first language, but distribution of first language of speakers in test data is unknown.  Study more data with unsupervised algorithm

Unsupervised Learning Experiment • A developmental set (instead of test set) was used to determine the biphone models to retain • The proposed filtered-biphone system uses 25% less biphone models, while EER performance is still comparable to the baseline unfiltered-biphone system

Equal Error Rate (EER) Results Biphone Models Fusion Experiments • Biphone systems are all superior to baseline monophone system • Filtered-biphone system is comparable with unfiltered-biphone system, regardless with or without fusion with PRLM • 29.3% relative gain obtained when proposed unfiltered-biphone system fuses with PRLM.

Detection Error Trade-off

Discussion of Learned Rules • Dialect-discriminating biphones • Flap biphones [dx+r], [dx+axr], [dx+er] • Biphones [ae+s], [ae+th] occurring in “class”, “bath” • Biphones learned in supervised method, e.g. [r+s] • Non-dialect-discriminating biphones • No-speech sounds (e.g., filled pauses, coughing) • /zh/ biphones

What if more biphones are pruned? EER of test set (%) Amount of pruned biphone models determined by developmental set (%)

Contributions • We present systematic approaches to discovering dialect-discriminating biphones, with and without using phone transcriptions • The proposed filtered-biphone system achieves comparable performance to a baseline unfiltered-biphone system despite using 25% less biphone models • Our approach complements other systems. When the filtered-biphone system is fused with a PRLM system, we obtain 29% relative gains • This is a first step towards a linguistically-informative dialect recognition system

Future Work • Investigate corpora with transcriptions to enhance interpretability of phonological rules • Model dialect-specific biphones in other dialects to ensure approach is language/dialect independent • Incorporate more sophisticated techniques to enhance recognition performance • Potential clinical applications: diagnosing articulatory and phonological disorders

Additional Slides

Informative Dialect Identification