590 likes | 608 Views
Speech Assessment: Methods and Applications for Spoken Language Learning 語音評分的方法、應用與分享. J.-S. Roger Jang ( 張智星 ) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan. Outline. Introduction to speech assessment Methods
E N D
Speech Assessment: Methods andApplications for Spoken Language Learning語音評分的方法、應用與分享 J.-S. Roger Jang (張智星) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan
Outline • Introduction to speech assessment • Methods • Using learning to rank for speech assessment • Demos • Conclusions
Intro. to Speech Assessment • Goal • Evaluate a person’s utterance based on some acoustic features, for language learning • Also known as • Pronunciation scoring • CAPT (computer-assisted pronunciation training)
Four Aspects of Language Learning Skills Media SA! Easier for CALL Harder for CALL
Speech Assessment • Characteristics of ideal SA • Assessment levels: as detailed as possible • Syllables, words, sentences, paragraphs • Assessment criteria: as many as possible • timbre, tone, energy, rhythm, co-articulation, … • Feedbacks: as specific as possible • High-level correction and suggestions
Basic Assessment Criteria • Timber (咬字/音色) • Based on acoustic models • Tone (音調/音高) • Based on tone recognition (for tonal language) • Based on pitch similarity with the target utterance • Rhythm (韻律/音長) • Based on duration comparison with the target utterance • Energy (強度/音量) • Based on energy comparison with the target utterance
Additional Assessment Criteria • English • Stress (重音) • Levels (word or sentence) • Intonation (整句音調) • Declarative sentence • Interrogative sentence • Co-articulation(連音) • A red apple. • Did you call me? • Won’t you go? • Raise your hand. • Mandarin • Tone (聲調) • Retroflex (捲舌音) • Co-articulation (連音) • 兒化音 • Others • Pause
Types of SA • Types of SA (ordered by difficulty) • Type 1:有目標文字、有目標語句 • Type 2:有目標文字、無目標語句 • Type 3:無目標文字、有目標語句 • Type 4:無目標文字、無目標語句 • We are focusing on type 1 and 2.
第一類:有目標文字、有目標語句 • 方法: • 以語音辨識核心為基礎,進行語音和文字的強制對位(Forced Alignment, FA),再根據每一個Phone的相似度來進行評分 • 評分方式 • 音色:和語音辨識核心的語音模型比對 • 音調、韻律、強度:和目標語句比對 • 特性: • 由於FA的準確度很高,因此比較容易得到一致性較高的評分結果 • 範例: • myET (艾爾實驗室): www.myet.com • Saybot (說寶堂): www.saybot.com
第二類:有目標文字、無目標語句 • 方法: • 以語音辨識核心為基礎,進行語音和文字的強制對位(Forced Alignment),再根據每一個Phone的相似度來進行評分 • 評分方式 • 音色:和語音辨識核心的語音模型比對 • 音調:對於中文,可以經由文字處理來得到標準音調,再由語音進行音調辨識與評分。英文則無類似方法。 • 韻律、強度:無法比對 • 特性: • 由於FA的準確度很高,因此比較容易得到一致性較高的評分結果 • 教材準備較容易 • 但無法對韻律及音量進行評分 • 範例: • 階梯英文的 speak & score
第三類:無目標文字、有目標語句 • 方法: • 以語音辨識核心為基礎,進行語音的自由音節解碼(Free Syllable Decoding, FSD),再根據每一個音節字串的相似度來進行評分。 • 評分方式 • 音色:和目標語句音節字串進行比對 • 音調、韻律、強度:由FSD產生的音節來比對 • 特性: • 由於FSD的辨識率只有6~7成,因此比較難得到一致的評分結果。 • 也可以直接改用DTW來進行比對,但由於個人音色差異,評分的一致性較低。
Our Approach • Basic approach to timbre assessment • Lexicon net construction (Usually a sausage net) • Forced alignment to identify phone boundaries • Phone scoring based on several criteria, such as ranking, histograms, posterior prob., etc. • Weighted average to get syllable/sentence scores
Lexicon Net Construction • Lexicon net for “what are you allergic to?” • Sausage net with all possible (and correct) multiple pronunciations • Optional sil between words
Lexicon Net with Confusing Phones • Common errors for Japanese learners of Chinese • ㄖㄌ • 例:天氣熱天氣樂 • ㄑㄐ • 例:打哈欠 打哈見 • ㄘㄗ • 例:一次旅行一字旅行 • ㄢㄤ • 例:晚安晚ㄤ • Rule-based approach to creating confusing syllables (phonological rules!) • Rules: • Rule 1: re le • Rule 2: qi ji • Rule 3: ci zi • Rule 4: an ang • Example • 欠 (qian)見 (jian)、嗆 (qiang)、降 (jiang)
Example of Japanese Learners Speaking Chinese • 去年夏天熱死了 • Example 1 • Example 2 • 晚安 • Example 1 • Example 2 • 坐下來、慢慢吃 • Example 1 • 他不住的打哈欠 • Example 1 • 一次旅行 • Example 1 • 起風 • Example 1 • 休息 • Example 1
Lexicon Net with Confusing Phones • Lexicon net for “天氣熱、打哈欠” • Canonical form: tian qi re da ha qian • 16 variant paths in the net: 欠 見 熱 氣 嗆 樂 降 記
Automatic Confusing Syllable Id. Corpus of Japanese learners Of Chinese 強制對位以得到初步切音結果 對華語411音節進行比對 以找出每個音的混淆音 將混淆音節加入辨識網路 再進行強制對位及切音 切音結果不再變動? 輸出混淆音節 及辨識網路 No Yes
Error Pattern Identification (EPI) • Common insertions/deletions from users • 以「朝辭白帝彩雲間」為標準語句 • 任意處結束,例如「朝辭白帝」 • 任意處開始,例如「彩雲間」 • 任意處開始與結束,例如「白帝彩雲」 • 任意處開始與結束,並出現跳字,例如「白彩雲」 • 疊字,例如「朝…朝辭白帝彩雲間」 • 疊詞,例如「朝辭…朝辭白帝彩雲間」 • 疊字加換音,例如「朝(cao)…朝(zhao)辭白帝彩雲間」 • 兩字對調,例如「朝辭彩帝白雲間」 • 錯字,例如「朝辭白帝黑山間」
Lexicon Net for EPI (I) • 偵測「從頭開始、在任意處結束」的發音
Lexicon Net for EPI (II) • 偵測「從任意處開始,在尾端結束」的發音
Lexicon Net for EPI (III) • 偵測「從任意處開始,結束於任意處(但不可跳字)」的發音
Lexicon Net for EPI (IV) • 偵測「從任意處開始,結束於任意處,而且可以跳字)」的發音
Design Philosophy of Lexicon Nets • We need to strike a balance between recognition and lexicon • In the extreme, we can have a net for free syllable decoding to catch all error patterns. • The feasibility of free syllable decoding is offset by its not-so-high recognition rate.
Scoring Methods for Speech Assessment • Five phone-based scoring methods • Duration-distribution scores (durDis) • Log-likelihood scores (hmmLike) • Log-posterior scores (hmmPost) • Log-likelihood-distribution scores (likeDis) • Rank ratio scores (rkRatio) • All based on forced alignment to segment phones
Method 1: Duration-distribution Scores • PDF of phone duration • Obtained from forced alignment • Normalized by speech rate • Fitted by log-normal PDF • Max PDF score 100
Method 2: Log-likelihood Scores • Log-likelihood of phone with duration of frames : where is the likelihood of the frame with the observation vector
Method 3: Log-posterior Scores • Log-posterior of phone with duration : where
Method 4: Log-likelihood-distribution Scores • Use CDF of Gaussian for log-likelihood • CDF = 1 score = 100
Method 5: Rank Ratio Scores • Rank ratio • RR to score conversion where parameters a, b are phone specific. • Possible sets of competing phones for x+y • *+y • *+*
Demo of Our Prototype • ASR toolbox • http://mirlab.org/jang/matlab/toolbox/asr • Command: goDemoSa.m
Intro. to Learning to Rank • Learning to rank • A supervised learning algorithmwhich generates a ranking model based on a training set of partially order items. (A task somewhat between classification and regression.) Item 1 Item 9 Ordered by preference Rank function Item 9 Item 3 Item 7 Item 3 Item 7 Item 2
Learning to Rank: Methods and App. • Methods • Pointwise (e.g., Pranking) • Pairwise (e.g., RankSVM, RankBoost, RankNet) • Listwise (e.g., ListNet) • Applications • Webpage ranking • Machine translation • Protein structure prediction
Application of LTR to SA • Why use LTR for SA? • Human scoring is rank-based • Tsing Hua’s grading system is moving from scores (0~100) to ranks (A, B, C, D…). • Combination of features (scores) • Features are complementary. • Effective determination of ranking • LTR only generates numerical output with a ranking order as close as possible to the correct order. A optimum DP-approach is proposed.
LTR Score Segmentation Given: LTR scores (sorted) Desired rank We want to find the separating scores with score-to-rank function Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Such that is minimized.
LTR Score Segmentation by DP (I) • Formulate the problem in DP framework • Optimum-value function D(i,j): The minimum cost of mapping to rank • Recurrent equation • Boundary condition: • Optimum cost:
LTR Score Segmentation by DP (II) Computed rank 5 4 3 2 Desired rank 1 Local constraint: Recurrent formula:
LTR Score Segmentation with DP (III) Data distribution: DP path:
Corpora for Experiments • WSJ0 • 8000 training utterances, 84 speakers. For training biphone acoustic models for forced alignment • MIR-SD • Recordings of about 4000 multi-syllable English words by 22 students (12 females and 10 males.) with an intermediate competence level. • Originally designed for stress detection • Available at http://mirlab.org/dataSet/public
Human Scoring of MIR-SD • Human scoring • Only 50 utterances from each speaker of MIR-SD are scored by 2 humans, making a total of 1100 utterances • Human scoring are consistent:
Examples of MIR-SD • Level 5 • apparent, paragraphic, constellation • Level 3 • additive, timorous, availably • Level 1 • ambiguity, auxiliary, anachronism
Performance Indices • Performance indices used in the literature • hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4] • Recognition rate rRate = 33.33% • Recognition rate with tolerance 1 = 66.67% • Average absolute difference = 1 • Correlation coef = 0.54
LTR Combination of Scores • Features for LTR • durDis and rkRatio: raw scores • hmmLike, hmmPost, likeDis: DP segmentation • LTR • RankSVM • Linear kernel • Baseline • hmmPost with DP-based segmentation
Overall Performance Comparison • Legends • Score segmentation • Circles: DP • Triangles: k-means • Inside/outside tests • Solid lines: Inside • Dashed lines: Outside • Black lines: Baselines
Summary of the Experiment • Segmentation • DP (supervised learning) is betten than k-means (unsupervised learning) • Performance indices • Correlation coefficient is not intuitive (consider [4 5 4] and [1 2 1]) • Recog. rate and sum of abs. diff. can be optimized by LTR and DP segmentation
Demo: Practice of Mandarin Idioms of Length 4 (一語中的) • Level (difficulty) of an idiom is based on it’s freq. via Google search: • 孤掌難鳴 ===> 260,000 • 鶼鰈情深 ===> 43,300 • 亡鈇意鄰 ===> 22,700 • 舉案齊眉 ===> 235,000 • Can be adapted for English learning • Next step: multi-threading, fast decoding via FSM
Support Mandarin & English Support user-defined recitation script Next step: multithreading for recording & recognition Demo: Recitation Machine(唸唸不忘)
For Mandarin, English, Japanese Licensing for PC Applications