480 likes | 496 Views
This seminar provides an overview of the fundamentals and state-of-the-art techniques in speech recognition, with a focus on the major challenge of noise. Topics covered include waveform spectrogram, speech enhancement, speaker recognition, speech synthesis, and more.
E N D
Speech Recognition Fundamentals, State of the Art, and a Major Challenge - Noise Ji Ming Audio Engineering Seminar 23/01/2019 Overview September 2004
Information encoded in a voice signal • Language/words • Topic/meaning • The speaker(s) – ID, gender, age etc. • Dialect • Emotion – happy, sad, angry, normal • Background • clean or noisy • noise type, level • Transducer/communication channel • Bandwidth reduction • Codec introduced distortion Waveform Spectrogram Speech recognition Speaker recognition Speech enhancement
Other subjects of speech processing • To generate human speech • TTS (text to speech) • To compress speech for communication e.g. wireless, Internet, Bluetooth etc. Speech synthesis Speech communication
Speech recognition Fundamentals • Fundamentals • State of the art • Major challenges How to build a speech recogniser - Classical approaches
Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech
Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Matched Unknown speech
Computer-based recognition Channel • Speaker variability • Dialect • Pronunciation • Style (read or casual) • Stressed (Lombard effect) • …… Some major challenges in speech recognition
Corpora Language model Lexicon model Hypotheses Predict all possible expressions of realistic speech Predict all possible pronunciations of every word . . . . . . Phoneme transcription Acoustic model Compare Unknown speech
Computer-based recognition Channel • Speaker variability • Dialect • Pronunciation • Style (read or casual) • Stressed (Lombard effect) • …… • Microphone / Channel variability • Distortion • Distance • Wireless codec etc. • …… Some major challenges in speech recognition • Environment variability • Noise • Other speakers • ……
Corpora Language model Lexicon model Hypotheses Predict all possible expressions of realistic speech Predict all possible pronunciations of every word . . . . . . Phoneme transcription Predict all possible effects of noise/channel/emotion etc. on the sound of every phoneme Acoustic model Compare Unknown speech
casual, distant crosstalk casual telephony dictation with non-speech audio Achieved by DNN technology approach
Speech recognition • Fundamentals • State of the art • Major challenges State of the art How to build a speech recogniser - Deep learning approaches
Deep learning approaches for speech recognition • Three main frameworks • Hybrid DNN-HMM architecture • Use of DNN-derived features in a classical HMM-GMM • End-to-end DNNs
Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech
Classical GMM-HMM Word / phoneme string hypothesis HMM GMM Unknown speech
Hybrid DNN-HMM Word / phoneme string hypothesis HMM Layer N DNN Layer 2 Layer 1 Unknown speech
DNN as Feature Extractor GMM-HMM DNN extracting bottleneck features Unknown speech
Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech
End-to-End DNNs • Integrate language model, lexicon model and acoustic model • Take raw waveform/spectra as input, produces characters or words as output • Example: • speech-to-spelling characters • Directly compute • P(w1w2 … wN| x1x2 … xT)
State of the art – the TIMIT task • Phonetic transcription • Artificial sentences, phonetic rich • Recognizing phones • Phones’ pronunciations vary heavily with context • Some phones are as short as a single frame (10-20ms) Accuracy 80% 75% 1990 1995 2000 2005 2012 Classical HMM DNN (e.g. DBN, CDNN)
State of the art – the Switchboard task • Large-vocabulary, conversational, telephony speech Error rate%
State of the art – end-to-end speech recognition • The WSJ task – large vocabulary, read speech recognition – Input: raw speech spectrogram • Output: spelling characters • No language model, no dictionary
State of the art – voice search / YouTube transcribe • Bing mobile voice search application (BMVS) Sentence accuracy: 63.8% based on GMM-HMM 69.6% based on DNN-HMM • YouTube data transcription (Google) Word accuracy: 47.7% based on GMM-HMM 53.8%based on DBN-DNN
Speech recognition • Fundamentals • State of the art • Major challenges Major challenges An unsolved problem - Noise
A major challenge - noise • Systems trained using clean data do not work for noisy data
A major challenge - noise • Systems trained using clean data do not work for noisy data Speaker recognition accuracy (%) in noise – in collaboration with MIT
A major challenge - noise • Systems trained using clean data do not work for noisy data • The Aurora 2 task – connected digits recognition • The Aurora 4 task – large vocabulary, read speech recognition • Typicalnoises considered in databases Airport, street traffic, restaurant, babble, subway, car, exhibition, etc. (subway) (babble)
Methods to improve noise robustness • Noise-robust acoustic features • Speech enhancement • RASTA • SPLICE • Normalization / adaptation etc. • Noise-robust acoustic models • Noise compensation (e.g., PMC, Taylor expansion, etc.) • Missing feature theory • Uncertainty decoding • Data augmentation / Multi-condition training
Deep learning for noisy speech recognition Artificially add noise to clean training data to learn variable environments . . . 0db 5db Noise/SNR 10db Car Clean training speech Restau- rant Street . . . Subway Used to train the DNNs
Deep learning for noisy speech recognition Multi-condition data Clean speech estimate ? Learning Clean data Fixed-length segment Noisy speech
Deep learning for speech recognition • Aurora 2 (digit strings) Noisy (SNR=0 dB) • Aurora 4 (large vocabulary)
Challenges faced by data augmentation • Impossible to collect all possible forms of noise • So many DNNs generalise poorly, to unseen noise • Can Big Data solve the problem? Not really. • Training with too much noise blurs speech / noise boundary • Data collection and training processes can be very slow Noise Noise Noise Noisier speech Noisy speech Noise Clean speech Noise Noise Noise
Examples of poor noise generalisation • Aurora 2 (digit strings) SNR = 0 dB unseen noise State of art DNN
Going wide: a radical solution to unseen noise Noisy speech Classical methods denoise frame by frame a frame (20-30ms) No speech/noise distinction So must know noise to recover speech Deep learningmethods denoise segment by segment a segment (100-200ms) No clear speech/noise distinction So must know noise to recover speech A DNN only ‘hears’ a short period of speech for denoising
Going wide: a radical solution to unseen noise Noisy speech Classical methods denoise frame by frame a frame (20-30ms) No speech/noise distinction So must know noise to recover speech Deep learning methods denoise segment by segment a segment (100-200ms) No clear speech/noise distinction So must know noise to recover speech • How do humans pick out speech in strong noise? • - Try to make sense of the speech • Human voice? • Semantic sense? • By hearing the voice long, and by making sense, we can extract the speech from arbitrary noise!? • Can we emulate what humans do – to gain greater robustness to untrained noise? a much longer segment (1s) an even longer segment (2s)
Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match At what conditions can we find the perfect match, without knowledge of the noise? . . . Candidates (clean training data)
Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match match Consider comparing two segments of increasing lengths . . . Candidates (clean training data)
Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match match Consider comparing two segments of increasing lengths . . . Candidates (clean training data)
Going wide: an Oracle experiment Measures for comparing two segments of length frames Correlation (ZNCC) Euclidean distance (used in GMM etc.)
Going wide: an Oracle experiment Symphony noise noisy Correlation L = 10 (0.1s) L = 20 L = 50 (0.5s) L = 80 Gaussian L = 100 (1s) L = 140 (1.4s) Without knowing the noise, the accuracy of retrieving the perfect match as a function of segment length L (SNR = -5 dB)
Going wide: a major obstacle Ground truth Noise Hidden Noisy speech to recognise Observable Find match Impossible to have perfect match for all possible very long speech segments . . . Candidates (clean training data)
Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Find match Consider all possible chains of shorter segments . . . Candidates (clean training data)
Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Find match Consider all possible chains of shorter segments . . . Candidates (clean training data)
Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Convert noisy speech recognition to “clean” speech/speaker identification Find match Clean speaker recognition Match Clean speech recognition Consider all possible chains of shorter segments Accept a match if it sounds to be - same speaker - meaningful . . . Candidates (clean training data)
Wide vs. Deep: unseen/untrained noise • Aurora 2 (digit strings) SNR = 0 dB unseen noise State of art DNN unseen noise Wide matching
Some other applications • Bluetooth & in-car communication, with CSR (now part of Qualcomm), EPSRC KTS project Noisy Enhanced Classical (LMMS) • Best accuracy for speaker recognition on MIT hand-held device data • CHiMESpeech Separation and Recognition Challenge, first place, by NTT (Japan) • Hearing aid, joint MRC/EPSRC project • Best paper in Speech Enhancement, Interspeech’2010 • 1st in the UK, and 3rd internationally, 1st International Speech Separation Challenge
Two-talker speech separation Difficulty increases TMR (dB) 0 dB -10 dB Different Gender Difficulty increases Same