Dialectal Chinese Speech Recognition

Thomas Fang Zheng Aug. 24, 2007 @ Cambridge University, UK Dialectal Chinese Speech Recognition

Outline • Motivation • Dialectal Chinese database collection • Wu • Min • Chuan • Approaches • Chinese syllable mapping • Lexicon adaptation • State-dependent phoneme-based model merging (SDPBMM) • Integration of SDPBMM with adaptation • Remarks

Motivation • Chinese ASR encounters an issue that is bigger than that of any other language - dialect. • There are 8 major dialectal regions in addition to Mandarin (Northern China), including:- • Wu (Southern Jiangsu, Zhejiang, and Shanghai); • Yue (Guangdong, Hong Kong, Nanning Guangxi); • Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); • Hakka (Meixian Guangdong, Hsin-chu Taiwan); • Gan (Jiangxi); • Xiang (Hunan); • Hui (Anhui) • Jin (Shanxi, Hohehot Inner Mongolia). • Can be further divided into over 40 sub-categories.

Chinese dialects share a same written language:- • The same Chinese pinyin set (canonically), • The same Chinese character set (canonically), and • The same vocabulary (canonically). • And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China. • However, speech is strongly influenced by the native dialects, most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect • In dialectal Chinese :- • Word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect. • ASR relies to a great extent on the consistent pronunciation and usage of words within a language. • ASR systems constructed to process PTH perform poorly for the great majority of the population.

Research Goal • To develop a general framework to model in dialectal Chinese ASR tasks :- • Phonetic variability, • Lexical variability, and • Pronunciation variability • To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :- • dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and • training/adaptation data (in relatively small quantities) • Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese. • This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world, and was postponed to 2004 due to SARS.

Dialectal Chinese Related Knowledge & Resources Standard Chinese Speech Recognizer + Dialectal Chinese Speech Recognition Framework Dialectal Chinese Speech Recognizer

For practical reasons, during the summer we only focused on one specific dialect, the Wu dialect (Shanghai Area), and the target language was Wu dialectal Chinese (WDC for short); • Why Wu dialect? • Population: more than 70 million people use WUdialect, the 2nd popular dialect in China; • Economy: one of the most advanced city in China – Shanghai • Wu dialect is a full-developed language • The syntax of Wu dialect is very complex; • The vocabulary is even more larger than Mandarin; • Many literature masterpiece were influenced by WU dialect (in history).

Useful Dialect-Related Knowledge • Chinese Syllable Mapping (CSM) • This CSM is dialect-related. • Two types: • Word-independent CSM: e.g. in Southern Chinese, Initial mappings include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on; • Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word '中国(China)' but only the tone is changed in word '过去(past)'.

The CSM is not exact. For any mapping AB, it is mostly that the resulted pronunciation is not B exactly, but something quite similar to B, more similar to B than to any other syllable. • The CSM could be N→1, 1→N, or crossed. • Bi is a variation of B, such as :- • nasalization, centralization, voiced,voiceless, rounding, syllabic, pharyngrealization, aspiration

Lexicon • Linguists say the vocabulary similarity rate between PTH and Wu dialect is about 60~70% • A dialect-related lexicon containing two parts :- • a common part shared by standard Chinese and most dialectal Chinese languages (over 50k words), and • a dialect-related part (several hundreds). • And in this lexicon :- • each word has one pinyin string for standard Chinese pronunciation and a kind of representation for dialectal Chinese pronunciation, and • each of those dialect-related words is corresponding to a word in the common part with the same meaning

Language • Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore • The language post-processing or language model adaptation techniques could be adopted.

我做饭给你吃 (PTH)我烧饭给你吃(Wu) • … • … w1 w2 w3 • Dialectal words substitute for some words w3 w3 • … • … • 你先走(PTH)你走先(Wu) w1 w2 w3 • Word-order changes w2 w2 w3 w2 1 V2 2

Our focus

Data Creation for WDC e-Dictionary Database IF & Syllable Set Definition Speech Transcription Database Collection PTH Words Wu Dialect Words Read Speech Spontaneous Speech C-Chars Syllables IFs/GIFs PTH Pron. PTH Pron. PTH Words Only Misc Info Wu Dialect Pron. Wu Dialect Pron. Topics PTH + Wu Words PTH Synonym Database Collection IF: a Chinese Initial or Final; GIF: generalized IF; PTH: Putonghua (standard Chinese); WDC: Wu Dialectal Chinese

Wu Dialectal Chinese (WDC) Database Collection (1) • Collection: • Totally 11 hours - Half read (R) + half spontaneous (S): • 100 Shanghai speakers * (3R +3S) minutes / speaker • 10 Beijing speakers * 6S minutes / speaker • Read speech with well-balanced prompting sentences; • Type I: each sentence contains PTH words only (5-6k) • Type II: each sentence contains one or two most commonly used Wu dialectal words while others are PTH words • Spontaneous speech with Pre-defined talking topics; • Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology • Balanced Speaker (gender, age, education, PTH level, …)

Goal Actual WDC Data Diversity

Accent Assessment by experts 1A. CCTV-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;2B. Less accented; 3A. More accented;3B. Hard to understand but known it is PTH

Accent Assessment according to age

Accent Assessment according to education level

Accent Assessment according to gender

Wu Dialectal Chinese (WDC) Database Collection (2) • Transcriptions include:- • For 100 Wu Dialectal Chinese speakers:- • Canonical Chinese Initial/Final labels, and • Generalized IF (GIF) labels. • For 10 Beijing speakers:- • Chinese character and pinyin transcriptions only

Dialectal Lexicon Construction • Establish a 50k-word electronic dialect dictionary with each word having :- • PTH pronunciation in PTH IF string • Wu dialect pronunciation in Wu IF string • Purpose: summarizing Dialect-Related Knowledge • Figure out Chinese syllable mappings:- • Same written form (character), different pronunciations; • Both word-independent and word-dependent; • Find dialect-related word variations:- • Same meanings in Chinese language; • Different written forms (character); • Uttered in standard Chinese manner; • For LM adaptation/modification

e-Dictionary Word Examples

Post-workshop Database Collection -- Min and Chuan 26 * With aid of Chinese Academy of Social Sciences (CASS)

29 Accent distribution for Min/Chuan-dialectal Chinese corpora

Workshop Experiments • Experiment Conditions: • Using HTK 3.2.1; • Data Set Division: • Using spontaneous speech data only • Data were split according to age (younger, older), education (higher, lower), and PTH level into • Training Set: 80 speakers • devTest Set: 20 speakers (a part of devTrain) • Test Set: 20 speakers • Acoustic model: • Trained from Mandarin Broadcast News (MBN); • 39 dimensional MFCC_E_D_A_Z; • diagonal covariance matrix; • 4 states per unit; • 103,041 units (triIF), 10,641 real units (triIF); • 3,063 different states (after state tying); • 16 mixtures per state, 28 mixtures per state for silence unit; • Language model: • Built on HKUST 100 hour CTS data, plus Hub5, plus Wu-Dialectal Training Data Transcriptions

Observation on WDC Data • IF-mapping / Syllable-mapping: • Influenced by Wu dialect, a Wu dialectal Chinese (WDC) speaker often pronounce any of a certain set of IFs into another IF, and there are rules to follow, such as zh -> z, ch -> c, sh -> s, and so on. • Observations on three sets - Train (80 speakers), devTest (20), and Test (20): • Mapping pairs almost the same among all three sets; • Mapping pairs almost identical to experts' knowledge; • Mapping probabilities also almost equal; • Remarks: • Experts' knowledge could be useful; • Mapping rules can be learned from less data.

Using only devTest set + dialect-based knowledge • Step 1: Apply PTH-IF mapping rules; • Step 2: Apply WDC-IF mapping rules; • Step 3: Apply syllable-dependent mapping rules; • Step 4: Perform multi-pronunciation expansion (MPE) based on unigram probability.

Why trying this method? • "IF-mapping" in dialectal Chinese is the fact (human uses it); • "In-domain data training" will sure get a good result but collecting data is a huge task, especially for 40 sub-dialects of Chinese; • "Mere adaptation" will be easier and better but might make it hard to distinguish those mapping pairs, each pair tends to become a single IF; • This is not practical in such applications where you have no more information about the speakers and a mixture of WDC and PTH is used as Call Centers; • It is expected that knowledge based method would result in an overall good performance for both WDC and PTH.

Step 1: Applying PTH-IF mapping rules • Rules are based on experts' knowledge (with AM unchanged) • (zh, z) (z, zh) • (ch, c) (c, ch) • (sh, s) (s, sh) • (eng, en) (en, eng) • (ing, in) (in, ing) • (r, l) • Gain not so significant: 0.5% Chinese Character Error Rate (CER) reduction • Pronunciation entry probability does not help improve performance

Step 2: Applying WDC-IF mapping rules • There indeed are some Wu dialect Chinese specific IFs, such as iao -> io^; • Rules learned from devTest • Newly introduced WDC specific IFs trained from devTest using adaptation method • 8.66% absolute CER reduction • MLLR adaptation outperforms MLLR+MAP • About 10% difference • Possibly due to less data • We referred it to surface form (WDC) MLLR adaptation; for comparison purpose, the base form (PTH) MLLR adaptation is also evaluated where only canonical IFs are used.

Step 3: Apply syllable-dependent mapping rules • Assumption: most IF-mappings are context-independent, but some are syllable-dependent (such as iii|(sh iii) -> ii|(s ii)), we believe there are others • Rules learned from devTest • We do not succeed in improving the accuracy, on the contrary, the character accuracy reduced by about 6% • We do not have a clear explanation yet • So we keep using context-free mapping rules

Step 4: Multi-pronunciation expansion (MPE) based on unigram probability • Motivation: more pronunciations help model pronunciation variations, but lead to more confusion, there should be tradeoff; • Accumulated unigram probability (AccProb) used as the criterion • Only words with higher unigram probabilities will have multiple pronunciations each; • Words with lower unigram probabilities will have a single standard pronunciation each;

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.10 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Base-form MLLR + PTH-IF mapping + MPE (CER)

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Surface-form MLLR + WDC-IF mapping + MPE (CER)

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Base-form MLLR + PTH-IF mapping + MPE (CER) Surface-form MLLR + WDC-IF mapping + MPE (CER)

Performance improvement comparison: overall, and in terms of speaker clusters

Q: How about recognizing PTH using the resulted WDC recognizer? • We obtain WDC recognizer from PTH recognizer; • We get a CER reduction of over 10% when recognizing WDC on an average; • How about using it to recognize PTH?

sh Adaptation sh s (Conventional Method) s sh sh MPE Rule + (Our method) s s

We can expect that using WDC recognizer to recognize PTH, the performance will degrade; • But we would expect it will not decrease too much; • Results: using WDC recognizer, you get • Over 10% CER reduction to recognize WDC; • 0.62% CER increase to recognize PTH.

Conclusions: • The use of knowledge is useful and effective • In this project, there are several problems to solve: channel, speaking-style, dialect background, and domain problems. • It is easier to solve all these problems by simply using the adaptation method; • Our method focuses only on the dialect problem; • The results using our method could be better if we integrate those methods related to channel, and speaking-style.

State-Dependent Phoneme-Based Model Merging (SDPBMM) • At acoustic level, approaches include: • Retraining the AM based on the standard speech and a certain amount of dialectal speech • Interpolation between standard speech-based HMMs and their corresponding dialectal speech based HMMs • Combination of AM with state-level pronunciation modeling • Adaptation with a certain amount of dialectal speech based on the standard speech-based AM • Existing problems: • A large amount of dialectal speech to build dialect-specific acoustic models • The acoustic model cannot demonstrate good performance in standard speech as well as dialectal speech recognition • Some acoustic modeling methods are too complicated to be deployed readily

What we proposed: • Taking a precise context-dependent HMM from the standard speech and its corresponding less precise context-independent HMM from dialectal speech into consideration simultaneously • Merging HMMs on a state-level basis according to certain criteria

Illustration for SDPBMM

pdf for merged state

Dialectal Chinese Speech Recognition