Speech Assessment: Methods and Applications for Spoken Language Learning 語音評分的方法、應用與分享

Speech Assessment: Methods andApplications for Spoken Language Learning語音評分的方法、應用與分享 J.-S. Roger Jang (張智星) jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang Multimedia Information Retrieval Lab CS Dept, Tsing Hua Univ, Taiwan

Outline • Introduction to speech assessment • Methods • Using learning to rank for speech assessment • Demos • Conclusions

Intro. to Speech Assessment • Goal • Evaluate a person’s utterance based on some acoustic features, for language learning • Also known as • Pronunciation scoring • CAPT (computer-assisted pronunciation training)

Four Aspects of Language Learning Skills Media SA! Easier for CALL Harder for CALL

Speech Assessment • Characteristics of ideal SA • Assessment levels: as detailed as possible • Syllables, words, sentences, paragraphs • Assessment criteria: as many as possible • timbre, tone, energy, rhythm, co-articulation, … • Feedbacks: as specific as possible • High-level correction and suggestions

Basic Assessment Criteria • Timber (咬字/音色) • Based on acoustic models • Tone (音調/音高) • Based on tone recognition (for tonal language) • Based on pitch similarity with the target utterance • Rhythm (韻律/音長) • Based on duration comparison with the target utterance • Energy (強度/音量) • Based on energy comparison with the target utterance

Additional Assessment Criteria • English • Stress (重音) • Levels (word or sentence) • Intonation (整句音調) • Declarative sentence • Interrogative sentence • Co-articulation(連音) • A red apple. • Did you call me? • Won’t you go? • Raise your hand. • Mandarin • Tone (聲調) • Retroflex (捲舌音) • Co-articulation (連音) • 兒化音 • Others • Pause

Types of SA • Types of SA (ordered by difficulty) • Type 1:有目標文字、有目標語句 • Type 2:有目標文字、無目標語句 • Type 3:無目標文字、有目標語句 • Type 4:無目標文字、無目標語句 • We are focusing on type 1 and 2.

第一類：有目標文字、有目標語句 • 方法： • 以語音辨識核心為基礎，進行語音和文字的強制對位（Forced Alignment, FA），再根據每一個Phone的相似度來進行評分 • 評分方式 • 音色：和語音辨識核心的語音模型比對 • 音調、韻律、強度：和目標語句比對 • 特性： • 由於FA的準確度很高，因此比較容易得到一致性較高的評分結果 • 範例： • myET (艾爾實驗室): www.myet.com • Saybot (說寶堂): www.saybot.com

第二類：有目標文字、無目標語句 • 方法： • 以語音辨識核心為基礎，進行語音和文字的強制對位（Forced Alignment），再根據每一個Phone的相似度來進行評分 • 評分方式 • 音色：和語音辨識核心的語音模型比對 • 音調：對於中文，可以經由文字處理來得到標準音調，再由語音進行音調辨識與評分。英文則無類似方法。 • 韻律、強度：無法比對 • 特性： • 由於FA的準確度很高，因此比較容易得到一致性較高的評分結果 • 教材準備較容易 • 但無法對韻律及音量進行評分 • 範例： • 階梯英文的 speak & score

第三類：無目標文字、有目標語句 • 方法： • 以語音辨識核心為基礎，進行語音的自由音節解碼（Free Syllable Decoding, FSD），再根據每一個音節字串的相似度來進行評分。 • 評分方式 • 音色：和目標語句音節字串進行比對 • 音調、韻律、強度：由FSD產生的音節來比對 • 特性： • 由於FSD的辨識率只有6～7成，因此比較難得到一致的評分結果。 • 也可以直接改用DTW來進行比對，但由於個人音色差異，評分的一致性較低。

Our Approach • Basic approach to timbre assessment • Lexicon net construction (Usually a sausage net) • Forced alignment to identify phone boundaries • Phone scoring based on several criteria, such as ranking, histograms, posterior prob., etc. • Weighted average to get syllable/sentence scores

Lexicon Net Construction • Lexicon net for “what are you allergic to?” • Sausage net with all possible (and correct) multiple pronunciations • Optional sil between words

Lexicon Net with Confusing Phones • Common errors for Japanese learners of Chinese • ㄖㄌ • 例：天氣熱天氣樂 • ㄑㄐ • 例：打哈欠 打哈見 • ㄘㄗ • 例：一次旅行一字旅行 • ㄢㄤ • 例：晚安晚ㄤ • Rule-based approach to creating confusing syllables (phonological rules!) • Rules: • Rule 1: re  le • Rule 2: qi  ji • Rule 3: ci  zi • Rule 4: an  ang • Example • 欠 (qian)見 (jian)、嗆 (qiang)、降 (jiang)

Example of Japanese Learners Speaking Chinese • 去年夏天熱死了 • Example 1 • Example 2 • 晚安 • Example 1 • Example 2 • 坐下來、慢慢吃 • Example 1 • 他不住的打哈欠 • Example 1 • 一次旅行 • Example 1 • 起風 • Example 1 • 休息 • Example 1

Lexicon Net with Confusing Phones • Lexicon net for “天氣熱、打哈欠” • Canonical form: tian qi re da ha qian • 16 variant paths in the net: 欠見熱氣嗆樂降記

Automatic Confusing Syllable Id. Corpus of Japanese learners Of Chinese 強制對位以得到初步切音結果對華語411音節進行比對以找出每個音的混淆音將混淆音節加入辨識網路再進行強制對位及切音切音結果不再變動？輸出混淆音節及辨識網路 No Yes

Error Pattern Identification (EPI) • Common insertions/deletions from users • 以「朝辭白帝彩雲間」為標準語句 • 任意處結束，例如「朝辭白帝」 • 任意處開始，例如「彩雲間」 • 任意處開始與結束，例如「白帝彩雲」 • 任意處開始與結束，並出現跳字，例如「白彩雲」 • 疊字，例如「朝…朝辭白帝彩雲間」 • 疊詞，例如「朝辭…朝辭白帝彩雲間」 • 疊字加換音，例如「朝（cao）…朝（zhao）辭白帝彩雲間」 • 兩字對調，例如「朝辭彩帝白雲間」 • 錯字，例如「朝辭白帝黑山間」

Lexicon Net for EPI (I) • 偵測「從頭開始、在任意處結束」的發音

Lexicon Net for EPI (II) • 偵測「從任意處開始，在尾端結束」的發音

Lexicon Net for EPI (III) • 偵測「從任意處開始，結束於任意處（但不可跳字）」的發音

Lexicon Net for EPI (IV) • 偵測「從任意處開始，結束於任意處，而且可以跳字）」的發音

Design Philosophy of Lexicon Nets • We need to strike a balance between recognition and lexicon • In the extreme, we can have a net for free syllable decoding to catch all error patterns. • The feasibility of free syllable decoding is offset by its not-so-high recognition rate.

Scoring Methods for Speech Assessment • Five phone-based scoring methods • Duration-distribution scores (durDis) • Log-likelihood scores (hmmLike) • Log-posterior scores (hmmPost) • Log-likelihood-distribution scores (likeDis) • Rank ratio scores (rkRatio) • All based on forced alignment to segment phones

Method 1: Duration-distribution Scores • PDF of phone duration • Obtained from forced alignment • Normalized by speech rate • Fitted by log-normal PDF • Max PDF  score 100

Method 2: Log-likelihood Scores • Log-likelihood of phone with duration of frames : where is the likelihood of the frame with the observation vector

Method 3: Log-posterior Scores • Log-posterior of phone with duration : where

Method 4: Log-likelihood-distribution Scores • Use CDF of Gaussian for log-likelihood • CDF = 1  score = 100

Method 5: Rank Ratio Scores • Rank ratio • RR to score conversion where parameters a, b are phone specific. • Possible sets of competing phones for x+y • *+y • *+*

Examples of Rank Ratio Scores

Demo of Our Prototype • ASR toolbox • http://mirlab.org/jang/matlab/toolbox/asr • Command: goDemoSa.m

Intro. to Learning to Rank • Learning to rank • A supervised learning algorithmwhich generates a ranking model based on a training set of partially order items. (A task somewhat between classification and regression.) Item 1 Item 9 Ordered by preference Rank function Item 9 Item 3 Item 7 Item 3 Item 7 Item 2

Learning to Rank: Methods and App. • Methods • Pointwise (e.g., Pranking) • Pairwise (e.g., RankSVM, RankBoost, RankNet) • Listwise (e.g., ListNet) • Applications • Webpage ranking • Machine translation • Protein structure prediction

Application of LTR to SA • Why use LTR for SA? • Human scoring is rank-based • Tsing Hua’s grading system is moving from scores (0~100) to ranks (A, B, C, D…). • Combination of features (scores) • Features are complementary. • Effective determination of ranking • LTR only generates numerical output with a ranking order as close as possible to the correct order. A optimum DP-approach is proposed.

LTR Score Segmentation Given: LTR scores (sorted) Desired rank We want to find the separating scores with score-to-rank function Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Such that is minimized.

LTR Score Segmentation by DP (I) • Formulate the problem in DP framework • Optimum-value function D(i,j): The minimum cost of mapping to rank • Recurrent equation • Boundary condition: • Optimum cost:

LTR Score Segmentation by DP (II) Computed rank 5 4 3 2 Desired rank 1 Local constraint: Recurrent formula:

LTR Score Segmentation with DP (III) Data distribution: DP path:

Flow Charts of Our Experiment

Corpora for Experiments • WSJ0 • 8000 training utterances, 84 speakers. For training biphone acoustic models for forced alignment • MIR-SD • Recordings of about 4000 multi-syllable English words by 22 students (12 females and 10 males.) with an intermediate competence level. • Originally designed for stress detection • Available at http://mirlab.org/dataSet/public

Human Scoring of MIR-SD • Human scoring • Only 50 utterances from each speaker of MIR-SD are scored by 2 humans, making a total of 1100 utterances • Human scoring are consistent:

Examples of MIR-SD • Level 5 • apparent, paragraphic, constellation • Level 3 • additive, timorous, availably • Level 1 • ambiguity, auxiliary, anachronism

Performance Indices • Performance indices used in the literature • hr = [1 3 5 4 2 2], cr = [2 3 5 2 1 4] • Recognition rate rRate = 33.33% • Recognition rate with tolerance 1 = 66.67% • Average absolute difference = 1 • Correlation coef = 0.54

Performance Evaluation of Different Scoring Methods

LTR Combination of Scores • Features for LTR • durDis and rkRatio: raw scores • hmmLike, hmmPost, likeDis: DP segmentation • LTR • RankSVM • Linear kernel • Baseline • hmmPost with DP-based segmentation

Overall Performance Comparison • Legends • Score segmentation • Circles: DP • Triangles: k-means • Inside/outside tests • Solid lines: Inside • Dashed lines: Outside • Black lines: Baselines

Summary of the Experiment • Segmentation • DP (supervised learning) is betten than k-means (unsupervised learning) • Performance indices • Correlation coefficient is not intuitive (consider [4 5 4] and [1 2 1]) • Recog. rate and sum of abs. diff. can be optimized by LTR and DP segmentation

Demo: Practice of Mandarin Idioms of Length 4 (一語中的) • Level (difficulty) of an idiom is based on it’s freq. via Google search: • 孤掌難鳴 ===> 260,000 • 鶼鰈情深 ===> 43,300 • 亡鈇意鄰 ===> 22,700 • 舉案齊眉 ===> 235,000 • Can be adapted for English learning • Next step: multi-threading, fast decoding via FSM

Support Mandarin & English Support user-defined recitation script Next step: multithreading for recording & recognition Demo: Recitation Machine（唸唸不忘）

For Mandarin, English, Japanese Licensing for PC Applications

Speech Assessment: Methods and Applications for Spoken Language Learning 語音評分的方法、應用與分享