Cutting-Edge Speech Recognition Technology and Emerging Mandarin Engine Research

Speech Recognition New Technology and Mandarin Recognition Engine Research of High Performance 刘加清华大学电子工程系 Email: liuj@tsinghua.edu.cn

听写机 残疾人用品查询系统电话拨号消费电子 … … 语音识别实际应用学科基础声学数理统计信息论模式识别语音、语言学人工智能信号处理语音识别的应用背景和学科基础语音识别技术基础与应用

第一台语音识别机器的诞生 1950 动态规划在语音识别中的应用 1960 语音产生的声学理论 1970 DTW算法的出现 LPC在语音识别中的应用 1980 HMM以及相应技术在语音识别中的应用 1990 非特定人大词汇量连续语音识别算法逐步成熟语音识别技术发展历史中的重要事件语音识别技术回顾

Basic Modeling Framework • Optimal word sequence : • Acoustic Modeling where Q is base phones; is the probability that is pronounced by based phone sequence .

Basic Modeling Framework • The base phones Q consist of composite HMM formed by concatenating all of constituent based phones, the acoustic likelihood is given by where is a state sequence through the composite model. • Model parameters can be efficiently estimated from a corpus of training utterances using EM algorithm.

Basic Modeling Framework • Language Modeling N-Gram • Decoding • Compiling a network of all vocabulary words in parallel with a loop. Each word is represented by the sub-network of HMM phone phone models. • Exploit sharing and pruning to limit the number of active hypotheses. • Multiple pass with a lattice of word sequence hypotheses. • Different models are used in different pass.

Empirical 0.2 Gauss Gamma Uniform 0.15 Exponential Uniform 0.1 Exponential 0.05 15 0 10 15 20 0 5 Basic Modeling Framework • Duration-Based HMM or Segmental HMM • Probability of spectrum and durations can be expressed as: • Context dependent duration distributions is used. • Speech rate normalization is adopted by

Basic Modeling Framework • Beam Search Algorithms • Path-pruning: and • Dynamic beam pruning method using • Beam width be fixed to constant with N-best paths • Path Merging: • When two or more paths meet, it may be possible to merge them. • Strategy: Merges may occur when the respective history equivalence classifications match.

The State of the Art • Phonological Modeling • Co-articulation • Tri-phone and bi-phone • Decision tree clustering for shared physical model and shared state by using phonetics knowledge • Using soft-tying. Group each state with its one or two nearest neighbors and polls all of their Gaussian. • Acoustical Model where L is number logical phones; M is number of shared physical models.

The State of the Art • Gaussian mixtures for compensating intra-speaker and inter-speaker variability. • Discriminative Training • Minimum Classification Error Rate (MCE) • Minimum Discrimination Information (MDI) • Maximum Mutual Information (MMI)

The State of the Art • Linear discriminant analysis (LDA), confusion data analysis (CDA) or heteroscedastic discriminant analysis (HAD) for improving the discrimination of model. • Statespecific-based confusion classes are collected. • The statespecific linear discriminant transformation is used. • The discrimination of the acoustic models are improved by using discriminant analysis.

The State of the Art • Robust Speech Recognition • Microphone array • Normalization • Channel normalization by cepstral mean subtraction • RASTA (Relative Spectra) • Vocal tract length normalization for speaker variations • PMC (Parallel Model Combination) • Speech enhancement technology • Speech enhancement + PMC • At present no good effective methods

The State of the Art • Speaker Adaptation • Adaptation seeks to modify the models to make them a better fit to the speech. • Two main approaches are used • Model parameters can be treated as random variables and estimated using traditional Bayesian MAP techniques. • Maximum likelihood linear regression (MLLR). MLLR seeks to find an affine transform for Gaussian means which maximizes the likelihood of the adaptation data. • Fast adaptation • Eigenvoice (用一组特定人模型来快速自适应新的说话人模型 ) • Extended Maximum a Posteriori, EMAP (主要考虑模型之间的空间相关性)

The State of the Art • Confidence Measure (CM) • Information Source • Likelihood • Duration • Likelihood Ratio: • Confidence Measure (CM) Effects • Posterior probability: • A Posterior probability can be approximated by:

Output a Posteriori Probability …………...….. OUTPUT LAYER k …………...….. HIDDEN LAYER j INPUT LAYER p …………. Input Pattern The State of the Art • Integrated CM and Hierarchical Averaging • frame～half-syllable～word～keyword • APosterior Probability can estimated by Multi-layer Perceptrons (MLP) and using HMM traces as input.

Competition models Model Space [in] Model of [in] The State of the Art • Rejection Models • On line Garbage Models • Filler Models, Competition Models, Anti-Models

The State of the Art • Confusion networks and Multi-model combination (or integration) • Convert the decoder lattice output into a confusion matrix. • The lattice arcs can be clustered to form a linear graph with the probability that all parallel arcs form a confusion set. • Confusion graphs can be used to compute confidence scores. • Posterior probability can be computed using the forward backward algorithm based on the confusion graphs. • More accuracy acoustic models and long distance language models can be used.

Feature1 Feature2 Feature3 AM and LM search output + search Feature1 ….. Feature3 AM and LM output + search AM and LM AM and LM search Feature1 ….. Feature3 interaction output + search AM and LM The State of the Art • Multiple front-end approach • Multiple features concatenated approach • Post-processing approach by voting • Mixed interaction searching approach

The State of the Art • Language Model Improvement • Capture local syntactic and semantic dependencies for many languages • Under-trained problem • Interpolate word n-gram with a class-based language models • Combine the regular language models with the statistical language models • More statistical language structures are needed in dialog system.

The State of the Art • Long range dependencies • Trigger model • Maximum Entropy (ME) Framework • Approaches to exploiting syntactic and semantic models by using probabilistic parsers to uncover head words which can then be used as predictors. • Using Maximum Entropy (ME) training • Combining statistical approaches with computational linguistics

Mandarin Recognition • Mandarin pronunciations are in fact monosyllable. Some pronunciations are very short and easy to be confused. • At present the best recognition rate of acoustic models for all-syllables of Mandarin is lower than 80%. • I think that 85% recognition accuracy of all-syllables is a threshold. If this threshold is reached, the speech recognition rate of dictation machine can approach 100%.

Confusion Sets in Mandarin • Confusion sets: in Initials: “[q] set”, “[m]”, …, in Finals: “[ing] set”, “[iong] set”, …. Table 1. Recognition error rate for some Initials Table 2. Recognition error rate for some Finals • Co-articulation effects make some confusable Mandarin words become very difficultly recognized. Table 3. Recognition error rates for some confusable digits

Speech Feature Parameters • LPCC – Linear predictive coding cepstrum • MFCC – Mel frequency cepstral coefficients • PLPCC – Perceptual LPCC • Normalized energy and crossing zero rate • Duration information with the speech rate normalization • Prosodic information with fundamental frequency and stress • Harmonic spectral structure (HSS) • Mixed feature parameters • New features are researched and developed

Basic Model Model 1 Model 2 Parallel acoustic models with the different tied states for integrated acoustic model Model 3 Integrated-acoustic models 1 1 1 1 1 1 1 1 1 Parallel Model Structures • The Conventional Context-dependent sub-word HMMs and bead-on-a-string concatenating word models fail to recognize high confusable syllables of Mandarin. • Integrated HMMs with the parallel structures and tied states according to Mandarin pronunciation characteristic are used.

Integrated Multi-models Tone and Non-tone models Prosodic model Fuzzy-pronunciation Model Speech Rate model VAD model Channel-correct model Duration model Rejection and Command Understanding Model Speech Input LPCC, MFCC or PLPCC Feature Extraction Sub-HMM with Multi-Models Output Results Robust Algorithm All-Syllables and Language Models Block Diagram of Speech Recognition System Integrated Multi-Models • The integrated multi-models are used in order to improve the discrimination and robustness of acoustic models

Grammar Guide Grammar Guide Output Result Input Sender Receiver ID Number ID1 S1 R1 R2 S2 ID2 SM RN IDK Speech Recognition Networks Post Parcel Checking System • Post parcel checking task based on speech recognition • Mandarin and Sichuan Dialect speaker independent continuous speech recognition system with the multi-sub-tree networks • Vocabulary size: about 4500 city, town, or post office names, 1021 number strings

Post Parcel Checking System Table 4. Recognition accuracy comparisons with the different combinations of models and testing data Here ‘MM’ presents the Mandarin models; ‘MS’ presents the Sichuan Dialect models; ‘MX’ presents the mixed models; ‘TM’ presents the Mandarin testing data; ’TS’ presents the Sichuan Dialect testing data.

Speech Recognition Networks ...... ...... ...... Rule1 Name1 Title1 Site1 Rule2 Title2 Name2 Site2 ...... Start ...... 请接 ...... NameKname TitleKtile SiteKsite ...... ...... ...... ...... ...... RuleKrule ...... ...... ...... Telephone Speech Recognition • Telephone speech recognition system based on private automatic branch exchange (PABX) • 214 different command sentential forms • 200 person names, 50 site names (unite names) and 50 telephone numbers

Telephone Speech Recognition Table 5. Recognition accuracy comparisons with the different combinations of models and testing data

A stream of recognized digits Accept or Reject Speech in 6 8···0 7 Recognizer Traces of HMMs Rejection model Isolate Digit Recognition • Mandarin isolate digit recognition • Easy confusing syllables, such as “6/liu/ 9/jiu 0/ling/”, “1/yi/ 7/qi/”, “2/er/ 8/ba/” • Using CDA, digit recognition accuracy can be improved from 97.1% to 99.3% • Or rejecting 4.9% utterances, the MLP rejection model can boost digit recognition accuracy from 97.1% to 99.6%

100 100 99 99 Accuracy after Rejection(%) Accuracy after Rejection(%) 98 98 MLP-24 no differentials MLP-12 97 97 all trace LD relative duration LR AD 96 96 0 5 10 15 20 0 5 10 15 20 Rejecting Rate(%) Rejecting Rate(%) Isolate Digit Recognition • The RR-AR curves with “+” for Multi-layer Perceptronswith 24 hidden neurons (MLP-24), the curve with “” for MLP-12, the curve with “▲” for linear discrimination (LD), the curve with “×” LR; the curve with “*” anti-digit (AD) model.

7States Digit Model 0 Virtual Node … … Silence Model Silence Model Digit Model 9 Pause Model 1State 1State Digit String Recognition • Continuous Mandarin Digit String Recognition • MCE Training • CDA for Linear Discriminant Transformation • Duration Model • Tone Discrimination Models:

Digit String Recognition • Duration information extraction • Table 6. Recognition Rates for Mandarin digit strings

Frame Blocking Speech Widowing Pre-emphasis Autocorrelation Analysis LPC Analysis LPC Conversion LPCC Mel-Frequency Filter Banks MFCC FFT DCT Log Perceptual Weighting LPC Analysis LPC Conversion PLP IDFT Block diagram of LPCC/MFCC/PLP extraction All Syllable Recognition • 408 Mandarin Syllable Recognition • Spectral Feature extraction

Features SER Time Spending 12LPCC+12ΔLPCC+E+ΔE 33.48% 22.04% 12MFCC+12ΔMFCC+C0+ΔC0 25.84% 89.13% 12PLP+12ΔPLP+C0+ΔC0 25.36% 100% The Impact of Dimension’s Difference All Syllable Recognition • Feature discrimination comparisons Table 7. Performances with LPCC, MFCC or MF-PLP

Syllable Error Rate with Dimension: MFCC vs. PLP All Syllable Recognition 14MFCC+14ΔMFCC+14ΔΔMFCC+C0+ΔC0+ΔΔC0 (MFCC-45) 14PLPCC+14ΔPLPCC+14ΔΔPLPCC+C0+ΔC0+ΔΔC0 (PLPCC-45) Table 8. Syllable recognition error rate for MFCC and PLPCC

Telephone speech and VAD Telephone speech and VAD Voice Active Detection • Voice Active Detection Table 9. Recognition error comparisons without VAD and with VAD

Speaker Adaptation • Speaker Adaptation Combined MLLR and Confidence Measures • Maximum Likelihood Linear Regression (MLLR) • Confidence Measures (CM)

Speaker Adaptation • Supervised • Unsupervised • Supervised+Confidence Measure • Unsupervised+Confidence Measure Table 10. Performance (error rate) of Speaker Adaptation System

8Bit MCU Core Speech Chip • Low Price Speech Recognition Chip Design • Embedded ASIC design • 8bit MCU core with MAC8×8 (multiplier), ROM, RAM, A/D, D/A and LCD Driver et. al. • Areas of Applications • Dialog toy • Voice controlling toy • Voice remote controller for TV, DVD, air conditioner. • Voice dialer for telephone

8Bit MCU Core Speech Chip • Embedded Systems • Recognition rate speaker dependent speech recognition is 97% for 30 voice commands • Recognition rate of speaker independent speech recognition is 97% for 20 voice commands • Speaker verification • 16-32Kbits/s speech coder • Speech Synthesis

8Bit MCU Core Speech Chip

16Bit DSP Core Speech Chip • High Quality Speech Recognition Chip • Embedded ASIC design • 16-bit fixed-point DSP core (90MIPS), 8bit MCU core(40MIPS), ROM, RAM, 2channel A/D, 2channel D/A and 4PWM. • Embedded Speech Systems • Recognition rate of speaker dependent speech recognition is 97% for 300 voice commands • Recognition rate speaker independent speech recognition is 97% for 300 voice commands • Speaker verification and speaker identification • Text to speech (under development) • Two channel 5.3Kbits/s or 8Kbits/s speech coding

Language Learning Machine • Computer assisted pronunciation teaching system based on speech Recognition Technology • Automatically evaluating the pronunciation of the speaker based on the confidence measures • Directing the speaker to improve his or her pronunciations • The system can be used by • Children • Hearing handicapped people • A beginner of a second language, such as Chinese learning foreign language; foreigner learning Chinese

Language Learning Machine • Scores • Phonemes Error Detection in Pronunciation • Fluency Evaluation • Stress Judgment • Feedbacks • Text explanation • Image Display • Replay of the speech with different speech rate • Pronunciation Error Detection • Replacing errors • Omitting errors • Inserting errors

总结（体会） • 目前中小词汇量的语音识别技术已经比较成熟，可以在一般的环境下得到应用，其性能可以被接受，但还面临着噪声环境下稳健语音识别问题。 • 大词汇量连续语音识别技术对应用来讲还是一个早产的婴儿，该技术还不能够达到人们使用期望要求。 • 目前大多数国内外语音技术公司都还没有从语音识别技术中真正赚到钱，主要是在“烧钱” 。 • 目前在各种移动通信终端与固定服务器中语音识别技术不是一个核心或不可缺少的关键技术，主要起到锦上添花的作用。

总结（体会） • 语音识别技术最大应用环境是什么？ • 小型化可移动终端， • 信息查询系统； • 人的手不可触及的地方或在黑夜看不见的情况下； • 手忙、眼忙的情况下，或残疾人； • 键盘是终端小型化的最大障碍。当手机仅有手表那么大，人们无法用键盘进行控制的时候，语音识别将成为不可缺少的关键技术。 • 由于近几年来语音技术没有新的大突破，语音识别技术发展又进入一个新的平台期，面临着巨大的挑战，与之相应学科研究方向目前也处于一个重要的调整时期。

总结（体会） • DARPA began EARS (Effective, Affordable, Reusable Speech-to-Text) program in 2002. • GOAL:The program will focus on natural, unconstrained human-human speech from broadcasts and telephone conversations in a number of languages. • OBJECTIVES: Word error rates in the 5-10% range (reached within 36 months for broadcasts and within 60 months for conversations). Multiple source languages (including varieties of English, Chinese, and Arabic). • SCOPE:Wide-ranging, multidisciplinary research. Quantitative comparative evaluations of algorithm accuracy. • Duration: • 2002-2005 readiness • 2005-2007 application

谢谢各位

Cutting-Edge Speech Recognition Technology and Emerging Mandarin Engine Research

Cutting-Edge Speech Recognition Technology and Emerging Mandarin Engine Research

Presentation Transcript

CMU Shpinx Speech Recognition Engine

Speech Recognition

Speech Recognition

Speech Recognition

An Introduction to Mandarin Speech Recognition

Speech Recognition Technology

Speech recognition

Mandarin Chinese Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition Technology Applications

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition