360 likes | 373 Views
Explore the GoP algorithm for assessing pronunciation quality at the phone level, with explicit error modeling and collection of non-native speech data. Learn about performance measures and experimental results.
E N D
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter: Davidson Date: 2009/07/08, 2009/07/15
Contents • Introduction • Goodness of Pronunciation (GoP) algorithm • Basic GoP algorithm • Phone dependent thresholds • Explicit error modeling • Collection of a non-native database • Performance measures • The labeling consistency of the human judges • Experimental results • Conclusions and future work
Introduction (1/3) • CAPT systems (Computer-Assisted Pronunciation Training) • Word and phrase level scoring (’93, ’94, ’97) • Intonation, stress, and rhythm • Requires several recordings of native utterances for each word • Difficult to add new teaching material • Selected phonemic error teaching (1997) • Uses duration information or models trained on non-native speech
Introduction (2/3) • HMM has been used to produce sentence-level scores (1990, 1996) • Eskenazi’s system (1996) produces phone-level scores but no attempt to relate this to human judgement • Author’s proposed system: • Measures pronunciation quality for non-native speech at the phone level
Introduction (3/3) • Other issues • GoP algorithms with refinements • Performance measures for both GoP scores and scores by human judges • Non-native database • Experiments on these performance measures
Goodness of Pronunciation (GoP) algorithm: Basic GoP algorithm • A score for each phone • = likelihood of the acoustic segment corresponding to each phone • GoP = duration normalized log of the posterior probability for a phone given the corresponding acoustic segment
Basic GoP algorithm (2/5) • = the set of all phone models • = number of frames in • By assuming equal phone priors and approximating by its maximum:
Basic GoP algorithm (3/5) • Numerator term is computed using forced alignment with known transcription • Denominator term is determined using an unconstrained phone loop
Basic GoP algorithm (4/5) • If a mispronunciation has occurred, it is not reasonable to constrain the acoustic segment used to compute the maximum likelihood phone to be identical to the assumed phone • Hence, the denominator score is computed by summing the log likelihood per frame over the duration of • In practice, this will often mean that more than one phone in the unconstrained phone sequence has contributed to the computation of
Basic GoP algorithm (5/5) • Intuitive to use speech data from native speakers to train the acoustic models • However, non-native speech is characterized by different formant structures compared to those from a native speaker for the same phone • Adapt Gaussian means by MLLR • Use only one single global transform of the HMM Gaussian component mean to avoid adapting to specific phone error patterns
Phone dependent thresholds • The acoustic fit of phone-based HMMs differs from phone to phone • E.g. fricatives tend to have lower log likelihood than vowels • 2 ways to determine phone-specific thresholds • By using mean and variance for phone • By approximating human labeling behavior
Explicit error modeling (1/3) • 2 types of pronunciation errors • Individual mispronunciations • Systematic mispronunciations • Consists of substitutions of native sounds for sounds of the target language, which do not exist in the native language • Knowledge of the learner’s native language is included in order to detect systematic mispronunciation
Explicit error modeling (2/3) • Solution: a recognition network incorporating both correct pronunciation and common pronunciation errors in the form of error sublattices for each phone. • E.g. “but”
Explicit error modeling (3/3) • Target phone posterior probability • Scores for systematic mispronunciations • GoP that includes additional penalty for systematic mispronunciation
Collection of a non-native database (1/2) • Based on the procedures used for the WSJCAM0 corpus • Texts are composed of a limited vocabulary of 1500 words • 6 females and 4 males whose mother-tongues are Korean (3), Japanese (3), Latin-American Spanish (3), and Italian (1). • Each speaker reads 120 sentences • 40 common set of phonetically-balanced sentences • 80 sentences varied from session to session
Collection of a non-native database (2/2) • 6 human judges who speaks native British English • Each speaker was labeled by 1 judge • 20 sentences from a female Spanish speakers are used as calibration sentences • Annotated by all 6 judges • Transcriptions reflect the actual sound uttered by the speakers • Including phonemes from other languages
Performance measures (1/3) • Compares 2 transcriptions of the same sentence • Transcriptions are either transcribed by human judges or generated automatically • 4 types of performance measures • Strictness • Agreement • Cross-correlation • Overall phone correlation
Performance measures (2/3) • Compared on a frame by frame basis • Each error is marked as 1 or 0 otherwise. • Yields a vector of length with • Apply a Hamming window • Transition between 0 and 1 is too abrupt where as in practice the boundary is often uncertain • Forced alignment might be erroneous due to poor acoustic modeling of non-native speech • Window length
Strictness (S) • Measures how strict the judge was in marking pronunciation errors • Relative strictness
Overall Agreement (A) • Measures the agreement of all frames between 2 transcriptions • Defined in terms of cityblock distance between 2 transcription vectors
Cross-correlation (CC) • Measures the agreement between the error frames in either or both transcriptions • is the Euclidean distance
Phoneme Correlation (PC) • Measures the overall agreement of overall rejection statistics for each phone between 2 judges/systems • PC is defined as • is a vector of rejection count for each phone • denotes the mean rejection counts
Labeling consistency of the human judges (2/4) • All results are within an acceptable range • 0.85<A<0.95, mean = 0.91 • 0.40<CC<0.65, mean = 0.47 • 0.70<PC<0.85, mean = 0.78 • 0.03< <0.14, mean = 0.06 • These mean values can be used as a benchmark values
Experimental results (1/7) • Multiple mixture monophone models • Corpus: WSJCAM0 • Range of rejection threshold was restricted to lie within one standard deviation of the judges strictness • where
Experimental results (6/7) • Add error handling with Latin-American Spanish models to detect systematic mispronunciations
Experimental results (7/7) • Transcriptions comparison between human judges and the system with error network
Conclusions and future work • 2 GoP scoring mechanism • Basic GoP • GoP with systematic mispronunciation penalty • Refinement methods • MLLR adaptation • Independent thresholds trained from human judgement • Error network • Future work • Information about the type of mistake