Speech Assessment: Methods and Applications for Spoken Language Learning

Speech Assessment: Methods andApplications for Spoken Language Learning J.-S. Roger Jang (張智星) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan

Outline • Introduction to speech assessment • Methods • Using learning to rank for speech assessment • Demos • Conclusions

Intro. to Speech Assessment • Goal • Evaluate a person’s utterance based on some acoustic features, for language learning • Also known as • Pronunciation scoring • CAPT (computer-assisted pronunciation training)

Computer-Assisted Language Learning (CALL) • 4 aspects of CALL • ListeningEasier • SpeakingHarder • ReadingEasier • WritingHarder • Receptive skills are easier to be assisted by computers, while productive skills are harder to evaluate automatically. • SA plays an essential role in CALL for speech/pronunciation scoring.

Speech Assessment • Characteristics of ideal SA • Assessment levels: as detailed as possible • Syllables, words, sentences, paragraphs • Assessment criteria: as many as possible • timbre, tone, energy, rhythm, co-articulation, … • Feedbacks: as specific as possible • High-level correction and suggestions

Basic Assessment Criteria • Timber (咬字/音色) • Based on acoustic models • Tone (音調/音高) • Based on tone recognition (for tonal language) • Based on pitch similarity with the target utterance • Rhythm (韻律/音長) • Based on duration comparison with the target utterance • Energy (強度/音量) • Based on energy comparison with the target utterance

Additional Assessment Criteria • English • Stress (重音) • Levels (word or sentence) • Intonation (整句音調) • Declarative sentence • Interrogative sentence • Co-articulation(連音) • A red apple. • Did you call me? • Won’t you go? • Raise your hand. • Mandarin • Tone (聲調) • Retroflex (捲舌音) • Co-articulation (連音) • 兒化音 • Others • Pause

Types of SA • Types of SA (ordered by difficulty) • Type 1:有目標文字、有目標語句 • Type 2:有目標文字、無目標語句 • Type 3:無目標文字、有目標語句 • Type 4:無目標文字、無目標語句 • We are focusing on type 1 and 2.

Our Approach • Basic approach to timbre assessment • Lexicon net construction (Usually a sausage net) • Forced alignment to identify phone boundaries • Phone scoring based on several criteria, such as ranking, histograms, posterior prob., etc. • Weighted average to get syllable/sentence scores

Lexicon Net Construction • Lexicon net for “what are you allergic to?” • Sausage net with all possible (and correct) multiple pronunciations • Optional sil between words

Lexicon Net with Confusing Phones • Common errors for Japanese learners of Chinese • ㄖㄌ • 例：天氣熱天氣樂 • ㄑㄐ • 例：打哈欠 打哈見 • ㄘㄗ • 例：一次旅行一字旅行 • ㄢㄤ • 例：晚安晚ㄤ • Rule-based approach to creating confusing syllables • Rules: • Rule 1: re  le • Rule 2: qi  ji • Rule 3: ci  zi • Rule 4: an  ang • Example • 欠 (qian)見 (jian)、嗆 (qiang)、降 (jiang)

Lexicon Net with Confusing Phones • Lexicon net for “天氣熱、打哈欠” • Canonical form: tian qi re da ha qian • 16 variant paths in the net:

Automatic Confusing Syllable Id. Corpus of Japanese learners Of Chinese 強制對位以得到初步切音結果對華語411音節進行比對以找出每個音的混淆音將混淆音節加入辨識網路再進行強制對位及切音切音結果不再變動？輸出混淆音節及辨識網路 No Yes

Error Pattern Identification (EPI) • Common insertions/deletions from users • 以「朝辭白帝彩雲間」為標準語句 • 任意處結束，例如「朝辭白帝」 • 任意處開始，例如「彩雲間」 • 任意處開始與結束，例如「白帝彩雲」 • 任意處開始與結束，並出現跳字，例如「白彩雲」 • 疊字，例如「朝…朝辭白帝彩雲間」 • 疊詞例如「朝辭…朝辭白帝彩雲間」 • 疊字加換音，例如「朝（cao）…朝（zhao）辭白帝彩雲間」 • 兩字對調，例如「朝辭彩帝白雲間」 • 錯字，例如「朝辭白帝黑山間」

Lexicon Net for EPI (I) • 偵測「從頭開始、在任意處結束」的發音

Lexicon Net for EPI (II) • 偵測「從任意處開始，在尾端結束」的發音

Lexicon Net for EPI (III) • 偵測「從任意處開始，結束於任意處（但不可跳字）」的發音

Lexicon Net for EPI (IV) • 偵測「從任意處開始，結束於任意處，而且可以跳字）」的發音

Design Philosophy of Lexicon Nets • We need to strike a balance between recognition and lexicon • In the extreme, we can have a net for free syllable decoding to catch all error patterns. • The feasibility of free syllable decoding is offset by its not-so-high recognition rate.

Scoring Methods for Speech Assessment • Five phone-based scoring methods • Duration-distribution scores • Log-likelihood scores • Log-posterior scores • Log-likelihood-distribution scores • Rank ratio scores • All based on forced alignment to segment phones

Method 1: Duration-distribution Scores • PDF of phone duration • Obtained from forced alignment • Normalized by speech rate • Fitted by log-normal PDF • Max PDF  score 100

Method 2: Log-likelihood Scores • Log-likelihood of phone with duration of frames : where is the likelihood of the frame with the observation vector

Method 3: Log-posterior Scores • Log-posterior of phone with duration : where

Method 4: Log-likelihood-distribution Scores • Use CDF of Gaussian for log-likelihood • CDF = 1  score = 100

Method 5: Rank Ratio Scores • Rank ratio • RR to score conversion where parameters a, b are phone specific. • Possible sets of competing phones for x+y • *+y • *+*

Intro. to Learning to Rank • Learning to rank • A supervised learning algorithmwhich generates a ranking model based on a training set of partially order items. • Methods • Pointwise (e.g., Pranking) • Pairwise (e.g., RankSVM, RankBoost, RankNet) • Listwise (e.g., ListNet)

Application of LTR to SA • Why use LTR for SA? • Human scoring is rank-based: A+, A, B, B-… • Tsing Hua’s grading system is moving from scores (0~100) to ranks (A, B, C, D…). • Combination of features (scores) • Features are complementary. • Effective determination of ranking • LTR only generates numerical output with a ranking order as close as possible to the correct order. A optimum DP-approach is proposed.

LTR Score Segmentation Given: LTR scores (sorted) Desired rank We want to find the separating scores with score-to-rank function Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Such that is minimized.

LTR Score Segmentation by DP (I) • Formulate the problem in DP framework • Optimum-value function D(i,j): The minimum cost of mapping to rank • Recurrent equation • Boundary condition: • Optimum cost:

LTR Score Segmentation by DP (II) Computed rank 5 4 3 2 Desired rank 1 Local constraint: Recurrent formula:

LTR Score Segmentation with DP (III) Data distribution: DP path:

Flow Charts of Our Experiment

Corpora for Experiments • WSJ • For training biphone acoustic models for forced alignment • MIR-SD • Recordings of about 4000 multi-syllable English words by 22 students (12 females and 10 males.) with an intermediate competence level. • Originally designed for stress detection • Available at http://mirlab.org/dataSet/public

Human Scoring of MIR-SD • Human scoring • Only 50 utterances from each speaker of MIR-SD are scored by 2 humans, making a total of 1100 utterances • Human scoring are consistent:

Performance Indices • Performance indices used in the literature • hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4] • Recognition rate rRate = 33.33% • Recognition rate with tolerance 1 = 66.67% • Average absolute difference = 1 • Correlation coef = 0.54

Performance Evaluation of Different Scoring Methods

Overall Performance Comparison • Legends • Score segmentation • Circles: DP • Triangles: k-means • Inside/outside tests • Solid lines: Inside • Dashed lines: Outside • Black lines: Baselines

Demo: Practice of Mandarin Idioms of Length 4 (一語中的) • Level (difficulty) of an idiom is based on it’s freq. via Google search: • 孤掌難鳴 ===> 260,000 • 鶼鰈情深 ===> 43,300 • 亡鈇意鄰 ===> 22,700 • 舉案齊眉 ===> 235,000 • Can be adapted for English learning • Next step: multi-threading, fast decoding via FSM

Support Mandarin & English Support user-defined recitation script Next step: multithreading for recording & recognition Demo: Recitation Machine（唸唸不忘）

For Mandarin, English, Japanese Licensing for PC Applications

SA for Embedded Systems • Embedded platforms: PMP, iPhone, Androids

Demo: Tangible Companions • Chicken run (落跑雞) • Penguin for Tang Poetry (唐詩企鵝) • Robot Fighter (蘿蔔戰士) • Singing Bass & Dog (大嘴鱸魚和唱歌狗)

Tools and Tutorials • Tools • DCPR toolbox • http://mirlab.org/jang/matlab/toolbox/dcpr • SAP toolbox • http://mirlab.org/jang/matlab/toolbox/sap • ASR Toolbox • http://mirlab.org/jang/matlab/toolbox/asr • Tutorials • Data clustering and pattern recognition: • http://mirlab.org/jang/books/dcpr • Audio signal processing • http://mirlab.org/jang/books/audioSignalProcessing • Lab page (with demos): • http://mirlab.org

Other SA Issues to be Addressed • Core technology • Other acoustic features for scoring • Pitch: tone/intonation • Volume • Duration • Pause • Coarticulation • Error pattern identification • Application side • Mulimodal GUI • Extensions • Slightly adaptation • Paragraph-level SA • Text-free SA • Beyond pronunciation • Translation + recognition + assessment • Microphone types

Examples • Coarticulation • Knock it off! • Mom woke her up • Consonant+consonant • Bus stop • Push Shirley • Ask question • Jeff flew south through Tainan • Exception • Change jobs • Which Chair

Examples • Changes due to coarticulation • Would you like it? • Won’t you go? • Raise your hand. • It makes you look younger. • Softened sounds • Junction • Popcorn • Fruitful • Can and can’t • I can read the letter. • I can’t read the letter. • d and t • Better • Cider

Most Likely to be Mispronounced • Within Taiwan • Pleasure/pressure • World/war/word • Shirt/short • Walk/work • Flesh/fresh • Supply/surprise • Some/son • Confirm/conform • Cancel/cancer • Mouth/mouse • Measure/major • Police/please • Version/virgin

Conclusions • Conclusions • SA calls for more cues than ASR • SA requires techniques from ML/IR • Multi-modal approach to SA is a must • “Popcorn”, “Thursday” • On-going & future work • Tone recognition & assessment • Reliable error pattern identification

References • Witt, S. M. and Young, S. J., “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning”, Speech Communication 30, 95-108, 2000. • Kim, Y., Franco, H., and Neumeyer, L., “Automatic Pronunciation Scoring of Specific phone Segments for Language Instruction”, in Proceedings of the 4th European Conference on Speech Communication and Technology (Eurospeech ’97), pp. 649-652, Rhodes, 1997. • Neumeyer, L, Franco, H., Digalakis, V., and Weintraub, M., “Automatic Scoring of Pronunciation Quality”, Speech Communication 30, 83-93, 2000. • Franco, H., Neumeyer, L., Digalakis, V., and Ronen, O., “Combination of Machine Scores for Automatic Grading of Pronunciation Quality”, Speech Communication 30, 121-130, 2000. • Cincared, T., Gruhn, R., Hacker, C., Nöth, E., and Nakamura, S., “Automatic Pronunciation Scoring of Words and Sentences Independent from the Non-Native’s First Language”, Computer Speech and Language 23, 65-88, 2009. • Crammer, K. and Singer, Y., “Pranking with Ranking”, in proceedings of the conference on Neural Information Processing Systems (NIPS), 2001. • Joachims, T., “Optimizing Search Engines using Clickthrough Data”, in proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. • Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences”, in proceedings of ICML, pp170-178, 1998. • Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank using Gradient Descent”, in proceedings of ICML, pp. 89-96, 2005. • Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., and Li, H., “Learning to Rank: From Pairwise Approach to Listwise Approach”, in proceedings of the 24th International Conference on Machine Learning, pp. 129-136, Corvallis, OR, 2007. • Liang-Yu Chen , Jyh-Shing Roger Jang, “Automatic Pronunciation Scoring using Learning to Rank and DP-based Score Segmentation”, submitted to Interspeech 2010.

Speech Assessment: Methods and Applications for Spoken Language Learning