1 / 35

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning

Explore the GoP algorithm for assessing pronunciation quality at the phone level, with explicit error modeling and collection of non-native speech data. Learn about performance measures and experimental results.

samuelmoses
Download Presentation

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter: Davidson Date: 2009/07/08, 2009/07/15

  2. Contents • Introduction • Goodness of Pronunciation (GoP) algorithm • Basic GoP algorithm • Phone dependent thresholds • Explicit error modeling • Collection of a non-native database • Performance measures • The labeling consistency of the human judges • Experimental results • Conclusions and future work

  3. Introduction (1/3) • CAPT systems (Computer-Assisted Pronunciation Training) • Word and phrase level scoring (’93, ’94, ’97) • Intonation, stress, and rhythm • Requires several recordings of native utterances for each word • Difficult to add new teaching material • Selected phonemic error teaching (1997) • Uses duration information or models trained on non-native speech

  4. Introduction (2/3) • HMM has been used to produce sentence-level scores (1990, 1996) • Eskenazi’s system (1996) produces phone-level scores but no attempt to relate this to human judgement • Author’s proposed system: • Measures pronunciation quality for non-native speech at the phone level

  5. Introduction (3/3) • Other issues • GoP algorithms with refinements • Performance measures for both GoP scores and scores by human judges • Non-native database • Experiments on these performance measures

  6. Goodness of Pronunciation (GoP) algorithm: Basic GoP algorithm • A score for each phone • = likelihood of the acoustic segment corresponding to each phone • GoP = duration normalized log of the posterior probability for a phone given the corresponding acoustic segment

  7. Basic GoP algorithm (2/5) • = the set of all phone models • = number of frames in • By assuming equal phone priors and approximating by its maximum:

  8. Basic GoP algorithm (3/5) • Numerator term is computed using forced alignment with known transcription • Denominator term is determined using an unconstrained phone loop

  9. Basic GoP algorithm (4/5) • If a mispronunciation has occurred, it is not reasonable to constrain the acoustic segment used to compute the maximum likelihood phone to be identical to the assumed phone • Hence, the denominator score is computed by summing the log likelihood per frame over the duration of • In practice, this will often mean that more than one phone in the unconstrained phone sequence has contributed to the computation of

  10. Basic GoP algorithm (5/5) • Intuitive to use speech data from native speakers to train the acoustic models • However, non-native speech is characterized by different formant structures compared to those from a native speaker for the same phone • Adapt Gaussian means by MLLR • Use only one single global transform of the HMM Gaussian component mean to avoid adapting to specific phone error patterns

  11. Phone dependent thresholds • The acoustic fit of phone-based HMMs differs from phone to phone • E.g. fricatives tend to have lower log likelihood than vowels • 2 ways to determine phone-specific thresholds • By using mean and variance for phone • By approximating human labeling behavior

  12. Explicit error modeling (1/3) • 2 types of pronunciation errors • Individual mispronunciations • Systematic mispronunciations • Consists of substitutions of native sounds for sounds of the target language, which do not exist in the native language • Knowledge of the learner’s native language is included in order to detect systematic mispronunciation

  13. Explicit error modeling (2/3) • Solution: a recognition network incorporating both correct pronunciation and common pronunciation errors in the form of error sublattices for each phone. • E.g. “but”

  14. Explicit error modeling (3/3) • Target phone posterior probability • Scores for systematic mispronunciations • GoP that includes additional penalty for systematic mispronunciation

  15. Collection of a non-native database (1/2) • Based on the procedures used for the WSJCAM0 corpus • Texts are composed of a limited vocabulary of 1500 words • 6 females and 4 males whose mother-tongues are Korean (3), Japanese (3), Latin-American Spanish (3), and Italian (1). • Each speaker reads 120 sentences • 40 common set of phonetically-balanced sentences • 80 sentences varied from session to session

  16. Collection of a non-native database (2/2) • 6 human judges who speaks native British English • Each speaker was labeled by 1 judge • 20 sentences from a female Spanish speakers are used as calibration sentences • Annotated by all 6 judges • Transcriptions reflect the actual sound uttered by the speakers • Including phonemes from other languages

  17. Performance measures (1/3) • Compares 2 transcriptions of the same sentence • Transcriptions are either transcribed by human judges or generated automatically • 4 types of performance measures • Strictness • Agreement • Cross-correlation • Overall phone correlation

  18. Performance measures (2/3) • Compared on a frame by frame basis • Each error is marked as 1 or 0 otherwise. • Yields a vector of length with • Apply a Hamming window • Transition between 0 and 1 is too abrupt where as in practice the boundary is often uncertain • Forced alignment might be erroneous due to poor acoustic modeling of non-native speech • Window length

  19. Performance measures (3/3)

  20. Strictness (S) • Measures how strict the judge was in marking pronunciation errors • Relative strictness

  21. Overall Agreement (A) • Measures the agreement of all frames between 2 transcriptions • Defined in terms of cityblock distance between 2 transcription vectors

  22. Cross-correlation (CC) • Measures the agreement between the error frames in either or both transcriptions • is the Euclidean distance

  23. Phoneme Correlation (PC) • Measures the overall agreement of overall rejection statistics for each phone between 2 judges/systems • PC is defined as • is a vector of rejection count for each phone • denotes the mean rejection counts

  24. Labeling consistency of the human judges (1/4)

  25. Labeling consistency of the human judges (2/4) • All results are within an acceptable range • 0.85<A<0.95, mean = 0.91 • 0.40<CC<0.65, mean = 0.47 • 0.70<PC<0.85, mean = 0.78 • 0.03< <0.14, mean = 0.06 • These mean values can be used as a benchmark values

  26. Labeling consistency of the human judges (3/4)

  27. Labeling consistency of the human judges (4/4)

  28. Experimental results (1/7) • Multiple mixture monophone models • Corpus: WSJCAM0 • Range of rejection threshold was restricted to lie within one standard deviation of the judges strictness • where

  29. Experimental results (2/7)

  30. Experimental results (3/7)

  31. Experimental results (4/7)

  32. Experimental results (5/7)

  33. Experimental results (6/7) • Add error handling with Latin-American Spanish models to detect systematic mispronunciations

  34. Experimental results (7/7) • Transcriptions comparison between human judges and the system with error network

  35. Conclusions and future work • 2 GoP scoring mechanism • Basic GoP • GoP with systematic mispronunciation penalty • Refinement methods • MLLR adaptation • Independent thresholds trained from human judgement • Error network • Future work • Information about the type of mistake

More Related