1 / 48

Speech Recognition Fundamentals: State of the Art and the Challenge of Noise

This seminar provides an overview of the fundamentals and state-of-the-art techniques in speech recognition, with a focus on the major challenge of noise. Topics covered include waveform spectrogram, speech enhancement, speaker recognition, speech synthesis, and more.

cristopher
Download Presentation

Speech Recognition Fundamentals: State of the Art and the Challenge of Noise

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition Fundamentals, State of the Art, and a Major Challenge - Noise Ji Ming Audio Engineering Seminar 23/01/2019 Overview September 2004

  2. Information encoded in a voice signal • Language/words • Topic/meaning • The speaker(s) – ID, gender, age etc. • Dialect • Emotion – happy, sad, angry, normal • Background • clean or noisy • noise type, level • Transducer/communication channel • Bandwidth reduction • Codec introduced distortion Waveform Spectrogram Speech recognition Speaker recognition Speech enhancement

  3. Other subjects of speech processing • To generate human speech • TTS (text to speech) • To compress speech for communication e.g. wireless, Internet, Bluetooth etc. Speech synthesis Speech communication

  4. Speech recognition Fundamentals • Fundamentals • State of the art • Major challenges How to build a speech recogniser - Classical approaches

  5. Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech

  6. Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Matched Unknown speech

  7. Computer-based recognition Channel • Speaker variability • Dialect • Pronunciation • Style (read or casual) • Stressed (Lombard effect) • …… Some major challenges in speech recognition

  8. Corpora Language model Lexicon model Hypotheses Predict all possible expressions of realistic speech Predict all possible pronunciations of every word . . . . . . Phoneme transcription Acoustic model Compare Unknown speech

  9. Computer-based recognition Channel • Speaker variability • Dialect • Pronunciation • Style (read or casual) • Stressed (Lombard effect) • …… • Microphone / Channel variability • Distortion • Distance • Wireless codec etc. • …… Some major challenges in speech recognition • Environment variability • Noise • Other speakers • ……

  10. Corpora Language model Lexicon model Hypotheses Predict all possible expressions of realistic speech Predict all possible pronunciations of every word . . . . . . Phoneme transcription Predict all possible effects of noise/channel/emotion etc. on the sound of every phoneme Acoustic model Compare Unknown speech

  11. casual, distant crosstalk casual telephony dictation with non-speech audio Achieved by DNN technology approach

  12. Speech recognition • Fundamentals • State of the art • Major challenges State of the art How to build a speech recogniser - Deep learning approaches

  13. Deep learning approaches for speech recognition • Three main frameworks • Hybrid DNN-HMM architecture • Use of DNN-derived features in a classical HMM-GMM • End-to-end DNNs

  14. Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech

  15. Classical GMM-HMM Word / phoneme string hypothesis HMM GMM Unknown speech

  16. Hybrid DNN-HMM Word / phoneme string hypothesis HMM Layer N DNN Layer 2 Layer 1 Unknown speech

  17. DNN as Feature Extractor GMM-HMM DNN extracting bottleneck features Unknown speech

  18. Corpora Language model Lexicon model Hypotheses . . . . . . Phoneme transcription Acoustic model Compare Unknown speech

  19. End-to-End DNNs • Integrate language model, lexicon model and acoustic model • Take raw waveform/spectra as input, produces characters or words as output • Example: • speech-to-spelling characters • Directly compute • P(w1w2 … wN| x1x2 … xT)

  20. State of the art – the TIMIT task • Phonetic transcription • Artificial sentences, phonetic rich • Recognizing phones • Phones’ pronunciations vary heavily with context • Some phones are as short as a single frame (10-20ms) Accuracy 80% 75% 1990 1995 2000 2005 2012 Classical HMM DNN (e.g. DBN, CDNN)

  21. State of the art – the Switchboard task • Large-vocabulary, conversational, telephony speech Error rate%

  22. State of the art – end-to-end speech recognition • The WSJ task – large vocabulary, read speech recognition – Input: raw speech spectrogram • Output: spelling characters • No language model, no dictionary

  23. State of the art – voice search / YouTube transcribe • Bing mobile voice search application (BMVS) Sentence accuracy: 63.8% based on GMM-HMM 69.6% based on DNN-HMM • YouTube data transcription (Google) Word accuracy: 47.7% based on GMM-HMM 53.8%based on DBN-DNN

  24. Speech recognition • Fundamentals • State of the art • Major challenges Major challenges An unsolved problem - Noise

  25. A major challenge - noise • Systems trained using clean data do not work for noisy data

  26. A major challenge - noise • Systems trained using clean data do not work for noisy data Speaker recognition accuracy (%) in noise – in collaboration with MIT

  27. A major challenge - noise • Systems trained using clean data do not work for noisy data • The Aurora 2 task – connected digits recognition • The Aurora 4 task – large vocabulary, read speech recognition • Typicalnoises considered in databases Airport, street traffic, restaurant, babble, subway, car, exhibition, etc. (subway) (babble)

  28. Methods to improve noise robustness • Noise-robust acoustic features • Speech enhancement • RASTA • SPLICE • Normalization / adaptation etc. • Noise-robust acoustic models • Noise compensation (e.g., PMC, Taylor expansion, etc.) • Missing feature theory • Uncertainty decoding • Data augmentation / Multi-condition training

  29. Deep learning for noisy speech recognition Artificially add noise to clean training data to learn variable environments . . . 0db 5db Noise/SNR 10db Car Clean training speech Restau- rant Street . . . Subway Used to train the DNNs

  30. Deep learning for noisy speech recognition Multi-condition data Clean speech estimate ? Learning Clean data Fixed-length segment Noisy speech

  31. Deep learning for speech recognition • Aurora 2 (digit strings) Noisy (SNR=0 dB) • Aurora 4 (large vocabulary)

  32. Challenges faced by data augmentation • Impossible to collect all possible forms of noise • So many DNNs generalise poorly, to unseen noise • Can Big Data solve the problem? Not really. • Training with too much noise blurs speech / noise boundary • Data collection and training processes can be very slow Noise Noise Noise Noisier speech Noisy speech Noise Clean speech Noise Noise Noise

  33. Examples of poor noise generalisation • Aurora 2 (digit strings) SNR = 0 dB unseen noise State of art DNN

  34. Going wide: a radical solution to unseen noise Noisy speech Classical methods denoise frame by frame a frame (20-30ms) No speech/noise distinction So must know noise to recover speech Deep learningmethods denoise segment by segment a segment (100-200ms) No clear speech/noise distinction So must know noise to recover speech A DNN only ‘hears’ a short period of speech for denoising

  35. Going wide: a radical solution to unseen noise Noisy speech Classical methods denoise frame by frame a frame (20-30ms) No speech/noise distinction So must know noise to recover speech Deep learning methods denoise segment by segment a segment (100-200ms) No clear speech/noise distinction So must know noise to recover speech • How do humans pick out speech in strong noise? • - Try to make sense of the speech • Human voice? • Semantic sense? • By hearing the voice long, and by making sense, we can extract the speech from arbitrary noise!? • Can we emulate what humans do – to gain greater robustness to untrained noise? a much longer segment (1s) an even longer segment (2s)

  36. Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match At what conditions can we find the perfect match, without knowledge of the noise? . . . Candidates (clean training data)

  37. Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match match Consider comparing two segments of increasing lengths . . . Candidates (clean training data)

  38. Going wide: an Oracle experiment Ground truth Noise Hidden Noisy speech to recognise Observable Find match match Consider comparing two segments of increasing lengths . . . Candidates (clean training data)

  39. Going wide: an Oracle experiment Measures for comparing two segments of length frames Correlation (ZNCC) Euclidean distance (used in GMM etc.)

  40. Going wide: an Oracle experiment Symphony noise noisy Correlation L = 10 (0.1s) L = 20 L = 50 (0.5s) L = 80 Gaussian L = 100 (1s) L = 140 (1.4s) Without knowing the noise, the accuracy of retrieving the perfect match as a function of segment length L (SNR = -5 dB)

  41. Going wide: a major obstacle Ground truth Noise Hidden Noisy speech to recognise Observable Find match Impossible to have perfect match for all possible very long speech segments . . . Candidates (clean training data)

  42. Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Find match Consider all possible chains of shorter segments . . . Candidates (clean training data)

  43. Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Find match Consider all possible chains of shorter segments . . . Candidates (clean training data)

  44. Going wide: a practical implementation Ground truth Noise Hidden Noisy speech to recognise Observable Convert noisy speech recognition to “clean” speech/speaker identification Find match Clean speaker recognition Match Clean speech recognition Consider all possible chains of shorter segments Accept a match if it sounds to be - same speaker - meaningful . . . Candidates (clean training data)

  45. Wide vs. Deep: unseen/untrained noise • Aurora 2 (digit strings) SNR = 0 dB unseen noise State of art DNN unseen noise Wide matching

  46. Some other applications • Bluetooth & in-car communication, with CSR (now part of Qualcomm), EPSRC KTS project Noisy Enhanced Classical (LMMS) • Best accuracy for speaker recognition on MIT hand-held device data • CHiMESpeech Separation and Recognition Challenge, first place, by NTT (Japan) • Hearing aid, joint MRC/EPSRC project • Best paper in Speech Enhancement, Interspeech’2010 • 1st in the UK, and 3rd internationally, 1st International Speech Separation Challenge

  47. Two-talker speech separation Difficulty increases TMR (dB) 0 dB -10 dB Different Gender Difficulty increases Same

  48. Thank you

More Related