1 / 34

Fast & Open-source Speech Recognition System Overview

Explore Facebook's .wav2letter++ speech recognition system. Learn about acoustic model architectures, training methods, language models, decoding techniques, and benchmarks. Discover how neural networks and convolutional networks are utilized for efficient speech recognition.

rosenda
Download Presentation

Fast & Open-source Speech Recognition System Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. wav2letter++: Facebook’s fast open-source speech recognition system • Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve • Vitaliy Liptchinsky, Ronan Collobert • Facebook AI Research Several slides borrowed from Ronan Collobert, who was not hurt in the process

  2. Research • Automatic Speech Recognition – overview and how it works • Acoustic model architectures • Training, ASG loss vs CTC loss (criteria) • Language models, decoding • Agenda • Toolkit • Overview, design • Flashlight • Benchmarks • Word Error Rate • Training/decoding speed

  3. research

  4. acoustic model features ðə kæt sæt the cat sat • automatic speech recognition Phonetic dictionary

  5. acoustic model features decoder language model features end-to-end training Can we make it simple? Can we make it differentiable? Better but scalable? Can we train them? acoustic model decoder the cat sat • end-to-end speech recognition language model

  6. features acoustic model decoder the cat sat • end-to-end speech recognition language model

  7. features Train them Trainable front-end Log-melfilterbanks • Approximate Log-melfilterbanks at initialization • Trained with the rest of the network • “Learning filterbanks from raw speech for phone recognition”, Zeghidour et al., ICASSP 2018 • “End-to-end speech recognition from the raw waveform”, Zeghidour et al., Interspeech 2018

  8. features acoustic model decoder the cat sat • end-to-end speech recognition language model

  9. acoustic model How it works Duration model: max |||the|caat|||ssaattt| |the|cat|sat| * let’s remember that | stands for silence | t a e | t c a | t h Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network

  10. acoustic model Make it simple Gated convnet block features GLU 1D ConvNet Dropout Gated Linear Units (GLU): • Address vanishing gradient problem • Successful application to NLP problems • Gates: • ⨂ is element-wise product between matrices • “Language Modeling with Gated Convolutional Networks”, Dauphin et al., ICML, 2017 • “Letter-Based Speech Recognition with Gated ConvNets”, Liptchinsky et al., arXiv 2017

  11. acoustic model Architecture and few tricks Gated convnet block Kernel width = 13 Channels 40 => 200 Dropout 0.2 Gated convnet block Kernel width = 14 Channels 200 => 220 Dropout 0.214 Gated convnet block Kernel width = 29 Channels 826 => 908 Dropout 0.59 features Linear layer 908 => 908 Dropout 0.59 Linear layer 908 => 30 • For each consecutive convolutional layer: increase kernel width, increase channels, increase dropout • Overall network receptive field of ~2.2 seconds, i.e. 2.2 seconds of audio correspond to one character • Motivation: more modeling capacity and regularization towards output layers • “Fully Convolutional Speech Recognition”, Zeghidour et al., arXiv, 2019

  12. features acoustic model decoder the cat sat • end-to-end speech recognition language model

  13. language model How it works • Language model • Statistical (n-gram) language models estimate probability distribution of a sequence of words, i.e. 3-gram language model generates • Feed-forward Neural Network models generate probability of a next word given a sequence of words, i.e. . • Character based language models: • Do acoustic models learn language modeling? • Anecdotally, acoustic models were observed to output instead of in noisy audio segments. • Thus, more regularization and more capacity at the output layers of acoustic models.

  14. language model Architecture of feed-forward neural network language models • Word embeddings • Gated Linear Units to the rescue! • Hierarchical Softmax: output probabilities for all words in the dictionary • “Language Modeling with Gated Convolutional Networks”, Dauphin et al., ICML, 2017

  15. features acoustic model decoder the cat sat • end-to-end speech recognition language model

  16. C acoustic model A c c Training: ASG loss (criterion) A a a B a a a a B b b C • ASG stands for Auto Segmentation • Segmentation problem: say "cab" is the target (the letter vocabulary is ) • Over 4 frames, can be written caab, ccab, cabb, etc… • Unnormalized transition scores: b b b b c c c c Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network Neural Network • “Wav2Letter: an End-to-End ConvNet-based Speech Recognition System”, Collobert et al., arXiv, 2016

  17. acoustic model Training: CTC vs ASG • CTCstands for Connectionist Temporal Classification • Extensively used in Speech Recognition, Optical Character Recognition • Has a blank label ø • Handles letøter repetitions • Handles garbage frames CTC ASG VS • “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”, A. Graves et al., ICML, 2006 • “Letter-Based Speech Recognition with Gated ConvNets”, Liptchinsky et al., arXiv 2017

  18. features acoustic model decoder the cat sat • end-to-end speech recognition language model

  19. decoder h e Howit works t t t a t u r n s a d o word LM G | acoustic A n n x lexicon L • Beam search, constrained to fixed beam size • Bookkeeping of (A, L, G) positions • At each time step: • For each previous hypothesis (A, L, G, score) • Add new hypothesis constrained to L • If word is emitted, add score from G • Mergenew hypothesis leading to same (L, G) states • For details on the differentiable decoder please check the paper in the footnote. Prefix, L The cat|sa Words, G The cat|sat • “A Fully Differentiable Beam Search Decoder”, Collobert et al., arXiv, 2019

  20. Toolkit

  21. NCCL, MPI Collectives Com. Lib. Recipes WSJ, LibriSpeech... wav2letter++ Executables Train, Test, Decode NN Lib Autograd, Modules Serialization, Training Criteria CTC, ASG, Seq2seq • Why C++? • It’s fast • It’s fast • Type safety/static typing • It’s fast CuDNN, NNPACK Accelerator Package • wav2letter++ design • Why ArrayFire? • JIT compilation • Portability: supports CUDA, CPU, OpenCL ArrayFire Tensor Library • NCCL, MPI • GPU (NVIDIA) and CPU communication libs • CuDNN, NNPACK • GPU (NVIDIA) and CPU accelerator packages • “wav2letter++: The Fastest Open-source Speech Recognition System”, Pratap et al., ICASSP, 2019

  22. def call(self, inputs, mask=None): pos = K.relu(inputs) if K.backend() == 'theano': neg = ( K.pattern_broadcast(self.alpha, self.param_broadcast) * (inputs - math_ops.abs(inputs)) * 0.5) else: neg = -self.alpha * K.relu(-inputs) return pos + neg then added • TensorFlow/Keras evaluated • PReLU implementation

  23. Variable PReLU::forward(const Variable &input) { auto mask = input >= 0.0; return (input * mask) + (input * !mask * tileAs(m_parameters[0], input)); } • ArrayFire NOT evaluated there the JIT avoids intermediate copies • PReLU implementation with a JIT works on CPU and GPU

  24. // Apply a Hamming window for a speech frames in parallel coefs = 0.54 - 0.46 * af::cos(2 * M_PI * af::iota(N, 1) / (N - 1)); af::array multiplyOp(const af::array& a, const af::array& b) { return a * b; } af::batchFunc(coefs, input, multiplyOp); • gfor • parallel (vectorized) loop • batchFunc • execute a function on a batch - in parallel • gfor and batchfunc batched over input

  25. Flashlight Neural Network library • Flashlight • From the creators of Torch • Entirely written in C++ • JIT compilation • CPU and GPU backends • https://github.com/facebookresearch/flashlight

  26. benchmarks

  27. How it is computed: • Levenstein distance between transcription produced by ASR system and the reference, at the word level • Examples: • Word Error Rate (WER) • REF: the cat sat on the mat • HYP: the cat sat mat • WER: 33%, 2 deletions • REF: the cat sat on the mat • HYP: the bat sat on at the mat • WER: 33%, 1 substitution, 1 insertion

  28. all results in Word Error Rate Fully Convolutional ASR Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert [1] DeepRecurrent Neural Networks for AcousticModelling, Chan and Lane [2] Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, Amodei et al. [3] Towardsbetterdecoding and language model integration in sequence to sequencemodels, Chorowski and Jaitly

  29. ASR toolkits

  30. log-scale • 8 GPUs nodes (Tesla V100), 100Gbps InfiniBand • CTC training; Kaldi: LF-MMI • benchmark: training epoch time 30M parameters 2 Convolutions 5 bi-LSTM 100M parameters 19 Convolutions

  31. 8 GPUsnodes (Tesla V100), 100Gbps InfiniBand • CTC training; Kaldi: LF-MMI • benchmark: training epoch time 100M parameters 19 Convolutions

  32. does not support n-gram LM • Same pre-computed emissions for all frameworks • LibriSpeech dev-clean, 4-gram LM • benchmark: decoding

More Related