660 likes | 834 Views
Informative Dialect Identification. Nancy Chen Oct. 31, 2008. Dialects, Accents, and Languages. Language Recognizer or L1 detector?. Language Recognizer. Indian English. Hindi. Automatic Speech Recognizers. I only understand English. You are speaking a foreign language. Indian English.
E N D
Informative Dialect Identification Nancy Chen Oct. 31, 2008
Language Recognizer or L1 detector? Language Recognizer Indian English Hindi
Automatic Speech Recognizers I only understand English. You are speaking a foreign language. Indian English
Traditional Automatic Recognizers speech 18, 53, … • Big black box • Input features not intuitive • Not F0, F1, F2 • Thousands of Gaussians, each with 40+ dimensions • Efficiently process lots of data • Hard to interpret models and results • Training data ~ 100+ hrs Traditional automatic recognizers
Linguistic Studies Spread the peanut butter • Few speakers • 20-30 at most • Perceptual analysis takes much time and effort • Phonological rules Linguistic studies
American English Speaker Spread the peanut butter • Voiceless stop consonants are unaspirated when preceded by fricatives • “p” in spread sounds more like “b” • Intervocalic /t/ flapped when followed by unstressed syllable • “t” in butter does not produce intra-oral pressure Linguistic studies
Indian English Speaker I can’t spread the peanut butter with Harr • Voiceless stop consonants are always unaspirated • /p/, /t/, /k/ sound like /b/, /d/, /g/ • Inter-dental fricatives become stop-like • “the” sounds like “de” • Alveolar consonants /t/, /d/, /n/ are retroflex • /w/ /v/ • British English influence • Rhoticity gone when “vowel + /r/” • /ae/ /a/; e.g., bath, can’t Linguistic studies
Goal Spread the peanut butter speech 18, 53, … Traditional automatic recognizers Linguistic studies Informative dialect identification …… [t][er][dx][er] …… speech American English
Potential Applications • Forensic phonetics • Speaker recognition and characterization • Automated speech recognition and synthesis • Accent training education • Articulatory and phonological disorder diagnosis
Challenges • Automatic phone recognition limitations • State-of-the-art “phone recognition” accuracy only 50-60% • Commercial speech recognition rely heavily on grammar and social context • Inadequately capture dialect differences • e.g. retroflex [t] recognized as typical [t], [r], [ax], … • Sub-dialects within Indian English
Related Research • Automatic speech recognition for non-native speech(Fung 2005, Livescu 2000) • Accent classification(Angkititrakul, Hansen 2006) • Language identification(Li, Ma, Lee 2007)
Techniques • Acoustic modeling (e.g., Torres-Carrasquillo et al. 2004) • Gaussian mixture models, hidden Markov models • N-grams of phonetic units (e.g., Zissman 1995) • Models the “grammar” of phones • PRLM (Phone Recognition followed by Language Modeling) • Our approach: acoustic modeling of dialect-discriminating phonetic contexts
Terminology & Notation • Monophone • e.g. [t], [a] • Biphone: a monophone in the context of other phones • Phonetic notation: • e.g. [k-r] is an [r] preceded by [k] • e.g. [t+a] is [t] followed by [a] • Mathematical notation: biphone variable b is phone followed by phone ; , ={monophone set} • Only consider two dialects d={d1, d2} • d1: American English • d2: Indian English
Finding Dialect-Specific Phonological Rules • Supervised Learning • If phone transcriptions are available • Unsupervised Learning • If no phone transcriptions are available
Supervised Classification • Extract phonological rules • Adapt biphone models • Dialect recognition task via likelihood ratio test
Supervised Rule Extraction: Example 1 Indian English Phone recognizer wine American English Phone recognizer vine • Recognition accuracy of the recognizer-hypothesized [v] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [v] differs across dialects
Supervised Rule Extraction: Example 2 Indian English Phone recognizer pat American English Phone recognizer bat • Recognition accuracy of the recognizer-hypothesized [b] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [b] differs across dialects
Supervised Rule Extraction: Example 3 Indian English Phone recognizer beats American English Phone recognizer butter • Recognition accuracy of the recognizer-hypothesized [dx+er] is 0% for Indian English, but 100% for American English. • Recognition accuracy of [dx+er] differs across dialects
Rule Extraction Criteria • Biphone b is dialect-discriminating for dialect d1 and d2 if • The recognition accuracy of biphone b in dialect d1 is different from that in dialect d2 • The occurrence frequency of biphone b is sufficient equations
Adapt Biphone Models adapt Dialect-neutral monophone model American-English-specific monophone model
Adapt Biphone Models adapt Dialect-neutral monophone model American-English-specific monophone model adapt American-English-specific monophone model American-English-specific biphone model
Dialect Recognition: likelihood scores Log Likelihood American-English biphone models Test utterance Indian-English biphone models Log Likelihood
Dialect Recognition: likelihood ratio test Log Likelihood Log Likelihood Ratio Test American-English biphone models Test utterance Indian-English biphone models Log Likelihood
Dialect Recognition: decision making Log Likelihood Log Likelihood Ratio Test American-English biphone models Detection Error Analysis Test utterance Indian-English biphone models Log Likelihood Threshold Determination Dialect decision
Unsupervised Classification • Unsupervised rule extraction • Adapt all biphone models • Prune out non-dialect-specific biphone models • Dialect recognition via likelihood ratio test
Retaining Biphone Models: Example Dialect-netural monophone model American English
Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English
Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model
Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model
Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model
Retaining Biphone Models: Example Dialect-netural monophone model American Biphone Model American English Indian Biphone Model The larger the log likelihood ratio of biphone [dx+er], the more dialect-specific [dx+er] is of American English
Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood
Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood
Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood
Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood
Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood
Quantifying Dialect Discriminability Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models AmericanEnglish Log Likelihood Keep Log Likelihood American biphone models Log Likelihood Ratio Indian biphone models IndianEnglish Log Likelihood equations
Experimental Setup • Training set: • 104 hrs of dialect-marked data without transcriptions • Test set: • 1298 American English trials • 200 Indian English trials • Each trial is 30 seconds • Dialect-neutral monophone HMM models: • trained on 23 hrs of transcribed data • 47 English monophones
Pilot Study: Dialect-Specific [r] Biphones • Recognizer decoded [r] instances were manually labeled in both dialects
Detection Error Trade-off Curve EER= Equal Error Rate
Discussion • [r]-biphones performs at least as well as monophones • [r]-biphones performs better when false alarms are penalized more • [r]-biphones not necessarily interpretable • Phone recognition errors • Rules only learned from minimal transcriptions (~ 1min speech) • Sub-dialect issues with Indian English. Rules derived from speakers with Hindi as first language, but distribution of first language of speakers in test data is unknown. Study more data with unsupervised algorithm
Unsupervised Learning Experiment • A developmental set (instead of test set) was used to determine the biphone models to retain • The proposed filtered-biphone system uses 25% less biphone models, while EER performance is still comparable to the baseline unfiltered-biphone system
Equal Error Rate (EER) Results Biphone Models Fusion Experiments • Biphone systems are all superior to baseline monophone system • Filtered-biphone system is comparable with unfiltered-biphone system, regardless with or without fusion with PRLM • 29.3% relative gain obtained when proposed unfiltered-biphone system fuses with PRLM.
Discussion of Learned Rules • Dialect-discriminating biphones • Flap biphones [dx+r], [dx+axr], [dx+er] • Biphones [ae+s], [ae+th] occurring in “class”, “bath” • Biphones learned in supervised method, e.g. [r+s] • Non-dialect-discriminating biphones • No-speech sounds (e.g., filled pauses, coughing) • /zh/ biphones
What if more biphones are pruned? EER of test set (%) Amount of pruned biphone models determined by developmental set (%)
Contributions • We present systematic approaches to discovering dialect-discriminating biphones, with and without using phone transcriptions • The proposed filtered-biphone system achieves comparable performance to a baseline unfiltered-biphone system despite using 25% less biphone models • Our approach complements other systems. When the filtered-biphone system is fused with a PRLM system, we obtain 29% relative gains • This is a first step towards a linguistically-informative dialect recognition system
Future Work • Investigate corpora with transcriptions to enhance interpretability of phonological rules • Model dialect-specific biphones in other dialects to ensure approach is language/dialect independent • Incorporate more sophisticated techniques to enhance recognition performance • Potential clinical applications: diagnosing articulatory and phonological disorders