490 likes | 495 Views
Speech Assessment: Methods and Applications for Spoken Language Learning. J.-S. Roger Jang ( 張智星 ) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan. Outline. Introduction to speech assessment Methods
E N D
Speech Assessment: Methods andApplications for Spoken Language Learning J.-S. Roger Jang (張智星) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan
Outline • Introduction to speech assessment • Methods • Using learning to rank for speech assessment • Demos • Conclusions
Intro. to Speech Assessment • Goal • Evaluate a person’s utterance based on some acoustic features, for language learning • Also known as • Pronunciation scoring • CAPT (computer-assisted pronunciation training)
Computer-Assisted Language Learning (CALL) • 4 aspects of CALL • ListeningEasier • SpeakingHarder • ReadingEasier • WritingHarder • Receptive skills are easier to be assisted by computers, while productive skills are harder to evaluate automatically. • SA plays an essential role in CALL for speech/pronunciation scoring.
Speech Assessment • Characteristics of ideal SA • Assessment levels: as detailed as possible • Syllables, words, sentences, paragraphs • Assessment criteria: as many as possible • timbre, tone, energy, rhythm, co-articulation, … • Feedbacks: as specific as possible • High-level correction and suggestions
Basic Assessment Criteria • Timber (咬字/音色) • Based on acoustic models • Tone (音調/音高) • Based on tone recognition (for tonal language) • Based on pitch similarity with the target utterance • Rhythm (韻律/音長) • Based on duration comparison with the target utterance • Energy (強度/音量) • Based on energy comparison with the target utterance
Additional Assessment Criteria • English • Stress (重音) • Levels (word or sentence) • Intonation (整句音調) • Declarative sentence • Interrogative sentence • Co-articulation(連音) • A red apple. • Did you call me? • Won’t you go? • Raise your hand. • Mandarin • Tone (聲調) • Retroflex (捲舌音) • Co-articulation (連音) • 兒化音 • Others • Pause
Types of SA • Types of SA (ordered by difficulty) • Type 1:有目標文字、有目標語句 • Type 2:有目標文字、無目標語句 • Type 3:無目標文字、有目標語句 • Type 4:無目標文字、無目標語句 • We are focusing on type 1 and 2.
Our Approach • Basic approach to timbre assessment • Lexicon net construction (Usually a sausage net) • Forced alignment to identify phone boundaries • Phone scoring based on several criteria, such as ranking, histograms, posterior prob., etc. • Weighted average to get syllable/sentence scores
Lexicon Net Construction • Lexicon net for “what are you allergic to?” • Sausage net with all possible (and correct) multiple pronunciations • Optional sil between words
Lexicon Net with Confusing Phones • Common errors for Japanese learners of Chinese • ㄖㄌ • 例:天氣熱天氣樂 • ㄑㄐ • 例:打哈欠 打哈見 • ㄘㄗ • 例:一次旅行一字旅行 • ㄢㄤ • 例:晚安晚ㄤ • Rule-based approach to creating confusing syllables • Rules: • Rule 1: re le • Rule 2: qi ji • Rule 3: ci zi • Rule 4: an ang • Example • 欠 (qian)見 (jian)、嗆 (qiang)、降 (jiang)
Lexicon Net with Confusing Phones • Lexicon net for “天氣熱、打哈欠” • Canonical form: tian qi re da ha qian • 16 variant paths in the net:
Automatic Confusing Syllable Id. Corpus of Japanese learners Of Chinese 強制對位以得到初步切音結果 對華語411音節進行比對 以找出每個音的混淆音 將混淆音節加入辨識網路 再進行強制對位及切音 切音結果不再變動? 輸出混淆音節 及辨識網路 No Yes
Error Pattern Identification (EPI) • Common insertions/deletions from users • 以「朝辭白帝彩雲間」為標準語句 • 任意處結束,例如「朝辭白帝」 • 任意處開始,例如「彩雲間」 • 任意處開始與結束,例如「白帝彩雲」 • 任意處開始與結束,並出現跳字,例如「白彩雲」 • 疊字,例如「朝…朝辭白帝彩雲間」 • 疊詞例如「朝辭…朝辭白帝彩雲間」 • 疊字加換音,例如「朝(cao)…朝(zhao)辭白帝彩雲間」 • 兩字對調,例如「朝辭彩帝白雲間」 • 錯字,例如「朝辭白帝黑山間」
Lexicon Net for EPI (I) • 偵測「從頭開始、在任意處結束」的發音
Lexicon Net for EPI (II) • 偵測「從任意處開始,在尾端結束」的發音
Lexicon Net for EPI (III) • 偵測「從任意處開始,結束於任意處(但不可跳字)」的發音
Lexicon Net for EPI (IV) • 偵測「從任意處開始,結束於任意處,而且可以跳字)」的發音
Design Philosophy of Lexicon Nets • We need to strike a balance between recognition and lexicon • In the extreme, we can have a net for free syllable decoding to catch all error patterns. • The feasibility of free syllable decoding is offset by its not-so-high recognition rate.
Scoring Methods for Speech Assessment • Five phone-based scoring methods • Duration-distribution scores • Log-likelihood scores • Log-posterior scores • Log-likelihood-distribution scores • Rank ratio scores • All based on forced alignment to segment phones
Method 1: Duration-distribution Scores • PDF of phone duration • Obtained from forced alignment • Normalized by speech rate • Fitted by log-normal PDF • Max PDF score 100
Method 2: Log-likelihood Scores • Log-likelihood of phone with duration of frames : where is the likelihood of the frame with the observation vector
Method 3: Log-posterior Scores • Log-posterior of phone with duration : where
Method 4: Log-likelihood-distribution Scores • Use CDF of Gaussian for log-likelihood • CDF = 1 score = 100
Method 5: Rank Ratio Scores • Rank ratio • RR to score conversion where parameters a, b are phone specific. • Possible sets of competing phones for x+y • *+y • *+*
Intro. to Learning to Rank • Learning to rank • A supervised learning algorithmwhich generates a ranking model based on a training set of partially order items. • Methods • Pointwise (e.g., Pranking) • Pairwise (e.g., RankSVM, RankBoost, RankNet) • Listwise (e.g., ListNet)
Application of LTR to SA • Why use LTR for SA? • Human scoring is rank-based: A+, A, B, B-… • Tsing Hua’s grading system is moving from scores (0~100) to ranks (A, B, C, D…). • Combination of features (scores) • Features are complementary. • Effective determination of ranking • LTR only generates numerical output with a ranking order as close as possible to the correct order. A optimum DP-approach is proposed.
LTR Score Segmentation Given: LTR scores (sorted) Desired rank We want to find the separating scores with score-to-rank function Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Such that is minimized.
LTR Score Segmentation by DP (I) • Formulate the problem in DP framework • Optimum-value function D(i,j): The minimum cost of mapping to rank • Recurrent equation • Boundary condition: • Optimum cost:
LTR Score Segmentation by DP (II) Computed rank 5 4 3 2 Desired rank 1 Local constraint: Recurrent formula:
LTR Score Segmentation with DP (III) Data distribution: DP path:
Corpora for Experiments • WSJ • For training biphone acoustic models for forced alignment • MIR-SD • Recordings of about 4000 multi-syllable English words by 22 students (12 females and 10 males.) with an intermediate competence level. • Originally designed for stress detection • Available at http://mirlab.org/dataSet/public
Human Scoring of MIR-SD • Human scoring • Only 50 utterances from each speaker of MIR-SD are scored by 2 humans, making a total of 1100 utterances • Human scoring are consistent:
Performance Indices • Performance indices used in the literature • hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4] • Recognition rate rRate = 33.33% • Recognition rate with tolerance 1 = 66.67% • Average absolute difference = 1 • Correlation coef = 0.54
Overall Performance Comparison • Legends • Score segmentation • Circles: DP • Triangles: k-means • Inside/outside tests • Solid lines: Inside • Dashed lines: Outside • Black lines: Baselines
Demo: Practice of Mandarin Idioms of Length 4 (一語中的) • Level (difficulty) of an idiom is based on it’s freq. via Google search: • 孤掌難鳴 ===> 260,000 • 鶼鰈情深 ===> 43,300 • 亡鈇意鄰 ===> 22,700 • 舉案齊眉 ===> 235,000 • Can be adapted for English learning • Next step: multi-threading, fast decoding via FSM
Support Mandarin & English Support user-defined recitation script Next step: multithreading for recording & recognition Demo: Recitation Machine(唸唸不忘)
For Mandarin, English, Japanese Licensing for PC Applications
SA for Embedded Systems • Embedded platforms: PMP, iPhone, Androids
Demo: Tangible Companions • Chicken run (落跑雞) • Penguin for Tang Poetry (唐詩企鵝) • Robot Fighter (蘿蔔戰士) • Singing Bass & Dog (大嘴鱸魚和唱歌狗)
Tools and Tutorials • Tools • DCPR toolbox • http://mirlab.org/jang/matlab/toolbox/dcpr • SAP toolbox • http://mirlab.org/jang/matlab/toolbox/sap • ASR Toolbox • http://mirlab.org/jang/matlab/toolbox/asr • Tutorials • Data clustering and pattern recognition: • http://mirlab.org/jang/books/dcpr • Audio signal processing • http://mirlab.org/jang/books/audioSignalProcessing • Lab page (with demos): • http://mirlab.org
Other SA Issues to be Addressed • Core technology • Other acoustic features for scoring • Pitch: tone/intonation • Volume • Duration • Pause • Coarticulation • Error pattern identification • Application side • Mulimodal GUI • Extensions • Slightly adaptation • Paragraph-level SA • Text-free SA • Beyond pronunciation • Translation + recognition + assessment • Microphone types
Examples • Coarticulation • Knock it off! • Mom woke her up • Consonant+consonant • Bus stop • Push Shirley • Ask question • Jeff flew south through Tainan • Exception • Change jobs • Which Chair
Examples • Changes due to coarticulation • Would you like it? • Won’t you go? • Raise your hand. • It makes you look younger. • Softened sounds • Junction • Popcorn • Fruitful • Can and can’t • I can read the letter. • I can’t read the letter. • d and t • Better • Cider
Most Likely to be Mispronounced • Within Taiwan • Pleasure/pressure • World/war/word • Shirt/short • Walk/work • Flesh/fresh • Supply/surprise • Some/son • Confirm/conform • Cancel/cancer • Mouth/mouse • Measure/major • Police/please • Version/virgin
Conclusions • Conclusions • SA calls for more cues than ASR • SA requires techniques from ML/IR • Multi-modal approach to SA is a must • “Popcorn”, “Thursday” • On-going & future work • Tone recognition & assessment • Reliable error pattern identification
References • Witt, S. M. and Young, S. J., “Phone-level Pronunciation Scoring and Assessment for Interactive Language Learning”, Speech Communication 30, 95-108, 2000. • Kim, Y., Franco, H., and Neumeyer, L., “Automatic Pronunciation Scoring of Specific phone Segments for Language Instruction”, in Proceedings of the 4th European Conference on Speech Communication and Technology (Eurospeech ’97), pp. 649-652, Rhodes, 1997. • Neumeyer, L, Franco, H., Digalakis, V., and Weintraub, M., “Automatic Scoring of Pronunciation Quality”, Speech Communication 30, 83-93, 2000. • Franco, H., Neumeyer, L., Digalakis, V., and Ronen, O., “Combination of Machine Scores for Automatic Grading of Pronunciation Quality”, Speech Communication 30, 121-130, 2000. • Cincared, T., Gruhn, R., Hacker, C., Nöth, E., and Nakamura, S., “Automatic Pronunciation Scoring of Words and Sentences Independent from the Non-Native’s First Language”, Computer Speech and Language 23, 65-88, 2009. • Crammer, K. and Singer, Y., “Pranking with Ranking”, in proceedings of the conference on Neural Information Processing Systems (NIPS), 2001. • Joachims, T., “Optimizing Search Engines using Clickthrough Data”, in proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. • Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences”, in proceedings of ICML, pp170-178, 1998. • Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank using Gradient Descent”, in proceedings of ICML, pp. 89-96, 2005. • Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., and Li, H., “Learning to Rank: From Pairwise Approach to Listwise Approach”, in proceedings of the 24th International Conference on Machine Learning, pp. 129-136, Corvallis, OR, 2007. • Liang-Yu Chen , Jyh-Shing Roger Jang, “Automatic Pronunciation Scoring using Learning to Rank and DP-based Score Segmentation”, submitted to Interspeech 2010.