Speech Recognition Fundamentals: State of the Art and the Challenge of Noise

Speech Recognition Fundamentals, State of the Art, and a Major Challenge - Noise Ji Ming Audio Engineering Seminar 23/01/2019 Overview September 2004

Information encoded in a voice signal • Language/words • Topic/meaning • The speaker(s) – ID, gender, age etc. • Dialect • Emotion – happy, sad, angry, normal • Background • clean or noisy • noise type, level • Transducer/communication channel • Bandwidth reduction • Codec introduced distortion Waveform Spectrogram Speech recognition Speaker recognition Speech enhancement

Other subjects of speech processing • To generate human speech • TTS (text to speech) • To compress speech for communication e.g. wireless, Internet, Bluetooth etc. Speech synthesis Speech communication

Speech recognition Fundamentals • Fundamentals • State of the art • Major challenges How to build a speech recogniser - Classical approaches

Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech

Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Matched Unknown speech

Computer-based recognition Channel • Speaker variability • Dialect • Pronunciation • Style (read or casual) • Stressed (Lombard effect) • …… Some major challenges in speech recognition

Corpora Language model Lexicon model Hypotheses Predict all possible expressions of realistic speech Predict all possible pronunciations of every word . . . . . . Phoneme transcription Acoustic model Compare Unknown speech

Computer-based recognition Channel • Speaker variability • Dialect • Pronunciation • Style (read or casual) • Stressed (Lombard effect) • …… • Microphone / Channel variability • Distortion • Distance • Wireless codec etc. • …… Some major challenges in speech recognition • Environment variability • Noise • Other speakers • ……

Corpora Language model Lexicon model Hypotheses Predict all possible expressions of realistic speech Predict all possible pronunciations of every word . . . . . . Phoneme transcription Predict all possible effects of noise/channel/emotion etc. on the sound of every phoneme Acoustic model Compare Unknown speech

casual, distant crosstalk casual telephony dictation with non-speech audio Achieved by DNN technology approach

Speech recognition • Fundamentals • State of the art • Major challenges State of the art How to build a speech recogniser - Deep learning approaches

Deep learning approaches for speech recognition • Three main frameworks • Hybrid DNN-HMM architecture • Use of DNN-derived features in a classical HMM-GMM • End-to-end DNNs

Classical GMM-HMM Word / phoneme string hypothesis HMM GMM Unknown speech

Hybrid DNN-HMM Word / phoneme string hypothesis HMM Layer N DNN Layer 2 Layer 1 Unknown speech

DNN as Feature Extractor GMM-HMM DNN extracting bottleneck features Unknown speech

End-to-End DNNs • Integrate language model, lexicon model and acoustic model • Take raw waveform/spectra as input, produces characters or words as output • Example: • speech-to-spelling characters • Directly compute • P(w1w2 … wN| x1x2 … xT)

State of the art – the TIMIT task • Phonetic transcription • Artificial sentences, phonetic rich • Recognizing phones • Phones’ pronunciations vary heavily with context • Some phones are as short as a single frame (10-20ms) Accuracy 80% 75% 1990 1995 2000 2005 2012 Classical HMM DNN (e.g. DBN, CDNN)

State of the art – the Switchboard task • Large-vocabulary, conversational, telephony speech Error rate%

State of the art – end-to-end speech recognition • The WSJ task – large vocabulary, read speech recognition – Input: raw speech spectrogram • Output: spelling characters • No language model, no dictionary

State of the art – voice search / YouTube transcribe • Bing mobile voice search application (BMVS) Sentence accuracy: 63.8% based on GMM-HMM 69.6% based on DNN-HMM • YouTube data transcription (Google) Word accuracy: 47.7% based on GMM-HMM 53.8%based on DBN-DNN

Speech recognition • Fundamentals • State of the art • Major challenges Major challenges An unsolved problem - Noise

A major challenge - noise • Systems trained using clean data do not work for noisy data

A major challenge - noise • Systems trained using clean data do not work for noisy data Speaker recognition accuracy (%) in noise – in collaboration with MIT

A major challenge - noise • Systems trained using clean data do not work for noisy data • The Aurora 2 task – connected digits recognition • The Aurora 4 task – large vocabulary, read speech recognition • Typicalnoises considered in databases Airport, street traffic, restaurant, babble, subway, car, exhibition, etc. (subway) (babble)

Methods to improve noise robustness • Noise-robust acoustic features • Speech enhancement • RASTA • SPLICE • Normalization / adaptation etc. • Noise-robust acoustic models • Noise compensation (e.g., PMC, Taylor expansion, etc.) • Missing feature theory • Uncertainty decoding • Data augmentation / Multi-condition training

Deep learning for noisy speech recognition Artificially add noise to clean training data to learn variable environments . . . 0db 5db Noise/SNR 10db Car Clean training speech Restau- rant Street . . . Subway Used to train the DNNs

Deep learning for noisy speech recognition Multi-condition data Clean speech estimate ? Learning Clean data Fixed-length segment Noisy speech

Deep learning for speech recognition • Aurora 2 (digit strings) Noisy (SNR=0 dB) • Aurora 4 (large vocabulary)

Challenges faced by data augmentation • Impossible to collect all possible forms of noise • So many DNNs generalise poorly, to unseen noise • Can Big Data solve the problem? Not really. • Training with too much noise blurs speech / noise boundary • Data collection and training processes can be very slow Noise Noise Noise Noisier speech Noisy speech Noise Clean speech Noise Noise Noise

Examples of poor noise generalisation • Aurora 2 (digit strings) SNR = 0 dB unseen noise State of art DNN

Going wide: a radical solution to unseen noise Noisy speech Classical methods denoise frame by frame a frame (20-30ms) No speech/noise distinction So must know noise to recover speech Deep learningmethods denoise segment by segment a segment (100-200ms) No clear speech/noise distinction So must know noise to recover speech A DNN only ‘hears’ a short period of speech for denoising

Going wide: a radical solution to unseen noise Noisy speech Classical methods denoise frame by frame a frame (20-30ms) No speech/noise distinction So must know noise to recover speech Deep learning methods denoise segment by segment a segment (100-200ms) No clear speech/noise distinction So must know noise to recover speech • How do humans pick out speech in strong noise? • - Try to make sense of the speech • Human voice? • Semantic sense? • By hearing the voice long, and by making sense, we can extract the speech from arbitrary noise!? • Can we emulate what humans do – to gain greater robustness to untrained noise? a much longer segment (1s) an even longer segment (2s)

Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match At what conditions can we find the perfect match, without knowledge of the noise? . . . Candidates (clean training data)

Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match match Consider comparing two segments of increasing lengths . . . Candidates (clean training data)

Going wide: an Oracle experiment Measures for comparing two segments of length frames Correlation (ZNCC) Euclidean distance (used in GMM etc.)

Going wide: an Oracle experiment Symphony noise noisy Correlation L = 10 (0.1s) L = 20 L = 50 (0.5s) L = 80 Gaussian L = 100 (1s) L = 140 (1.4s) Without knowing the noise, the accuracy of retrieving the perfect match as a function of segment length L (SNR = -5 dB)

Going wide: a major obstacle Ground truth Noise Hidden Noisy speech to recognise Observable Find match Impossible to have perfect match for all possible very long speech segments . . . Candidates (clean training data)

Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Find match Consider all possible chains of shorter segments . . . Candidates (clean training data)

Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Convert noisy speech recognition to “clean” speech/speaker identification Find match Clean speaker recognition Match Clean speech recognition Consider all possible chains of shorter segments Accept a match if it sounds to be - same speaker - meaningful . . . Candidates (clean training data)

Wide vs. Deep: unseen/untrained noise • Aurora 2 (digit strings) SNR = 0 dB unseen noise State of art DNN unseen noise Wide matching

Some other applications • Bluetooth & in-car communication, with CSR (now part of Qualcomm), EPSRC KTS project Noisy Enhanced Classical (LMMS) • Best accuracy for speaker recognition on MIT hand-held device data • CHiMESpeech Separation and Recognition Challenge, first place, by NTT (Japan) • Hearing aid, joint MRC/EPSRC project • Best paper in Speech Enhancement, Interspeech’2010 • 1st in the UK, and 3rd internationally, 1st International Speech Separation Challenge

Two-talker speech separation Difficulty increases TMR (dB) 0 dB -10 dB Different Gender Difficulty increases Same

Thank you

Speech Recognition Fundamentals: State of the Art and the Challenge of Noise

Speech Recognition Fundamentals: State of the Art and the Challenge of Noise

Presentation Transcript

September 16, 2004

September 2, 2004

Overview September 2004

September - 2004

September 23, 2004

Overview September 2004

Pretoria, September 2004

September 16, 2004

September 24, 2004

September 2004

September 2004

Overview September 2004

September 28th, 2004

September 2004

September 28, 2004

Overview September 2004

September 29, 2004

September 14, 2004.

Overview September 2004

September 2004

DiskCon September 2004

September 24, 2004