310 likes | 439 Views
Use of CUDA for Continuous Space Language Model. Elizabeth A. Thompson, Ph.D. a Timothy R. Anderson, Ph.D. b. a Purdue University, Fort Wayne Fort Wayne , IN, USA 46805. b Air Force Research Lab Wright Patterson Air Force Base Dayton, OH, USA 45433. Outline. I. CSLM Algorithm
E N D
Use of CUDA for Continuous Space Language Model Elizabeth A. Thompson, Ph.D.a Timothy R. Anderson, Ph.D.b aPurdue University, Fort Wayne Fort Wayne, IN, USA 46805 bAirForce Research Lab Wright Patterson Air Force Base Dayton, OH, USA 45433
Outline I. CSLM Algorithm II. Use of CUDA III. CUDA Architecture IV. CUDA Implementation of CSLM V. Results VI.Conclusions
Continuous-Space Language Models (CSLM) This work was based on the article “Continuous-Space Language Models for Statistical Machine Translation” by HolgerSchwenk of the University of Le Mans, France, published in the Prague Bulletin of Mathematical Linguistics, January 2010, and his corresponding open source implementation.
CSLM (Cont'd) The CSLM consists of a 3 layer neural network: projection layer, hidden layer, output layer. Input 3 word sequence Output The probability of all words in the vocabulary being the 4th word in the sequence.
Training of the CSLM The neural network must be trained through a process of adaptive learning. It is trained using a series of 63,070 4-grams: Prague Stock Market falls Stock Market falls to Market falls to minus falls to minus by target word
Training of the CSLM (Cont’d) Text file vocab.txt contains list of vocabulary terms Each of 14,024 terms in vocab.txt is assigned a numerical index, which is used for training the neural network: Indexterm 0 > 1 - … 619 abandon
Training the Neural Network In the training stage, values are propagated in the forward direction through the neural network in order to assign weighting values to the input data, and then errors are propagated in the reverse direction to improve these weighting factors.
Projection Layer The projection layer maps each of the 3 input words to a unique 256 length sequence. Initially, these are generated as uniformly distributed random values, but their values change as the neural network is trained. For each input word, the corresponding 256 length sequence is the output of the projection layer.
Projection layer The projection layer consists of a lookup table.
Hidden Layer For the forward pass, the output of the projection layer is fed as input to the hidden layer. 192x768 weight matrix 192x128 bias matrix 768x128 output of projection layer
Output Layer For the forward pass, the output of the hidden layer is fed as input to the output layer. After applying these weights and biases, a softmax normalization is applied. 14024x192 weight matrix 192x128 output of hidden layer 14024x128 bias matrix
Backward Pass for Training The error of the output compared to the target value is propagated backward through the network. Weights and biases in the output layer and then the hidden layer are updated. Finally, the projection layer table is updated to reflect the results of the forward pass.
Outline I. CSLM Algorithm II. Use of CUDA III. CUDA Architecture IV. CUDA Implementation of CSLM V. Results VI.Conclusions
CUDA for CSLM The GPU is specialized for compute intensive, highly parallel computation. All NVIDIA GPUs can support at least 768 concurrently active threads per multiprocessor. However, there is an overhead associated with using the GPU.
GPU Overhead To use the GPU, memory must be allocated on both the host CPU as well as on the GPU. Variables to be used in the computation must be transferred to the GPU. The computation is then performed on the GPU. The results must be transferred back to the host CPU.
Outline I. CSLM Algorithm II. Use of CUDA III. CUDA Architecture IV. CUDA Implementation of CSLM V. Results VI.Conclusions
CUDA Architecture GPU Streaming multiprocessor processors (cores)
CUDA Architecture (Cont’d) • The CUDA programmer defines functions, called kernels. • A kernel is executed as a grid of thread blocks. • The number of threads per block and threads per multiprocessor depend on compute capability of CUDA device.
Outline I. CSLM Algorithm II. Use of CUDA III. CUDA Architecture IV. CUDA Implementation of CSLM V. Results VI.Conclusions
Implementation of CSLM using CUDA The CSLM algorithm is highly computationally intensive and a good candidate for implementation with CUDA. The matrix multiplications in the hidden and output layer, both forward and backward pass, are highly parallel.
CUBLAS Routines for CSLM CUBLAS is a CUDA implementation of BLAS (Basic Linear Algebra Subprogram), which perform matrix multiplication operations. Provide matrix multiplications and handle all overhead issues regarding programming of threads—does not require programmer to define kernels, grids, or thread blocks.
CUBLAS Implementation of CSLM The matrix operations were replaced with the CUBLAS function, cublasSgemm(), which performs the operation A, B, and C are matrices containing single-precision values (floats). α and β are scalars.
CUBLAS Implementation of CSLM (Cont’d) • NVIDIA Performance Primitives Library (NPP) • nppsExp_32f_I – performs an exponential operation “in-place” on single precision values • nppsMulC_32f_I – performs “in-place” multiplication of a single precision matrix by a constant. • These functions were used to implement the softmax normalization operations.
Outline I. CSLM Algorithm II. Use of CUDA III. CUDA Architecture IV. CUDA Implementation of CSLM V. Results VI.Conclusions
Comparison of revised CUDA version using Quadro FX 5800 vs. original Schwenk algorithm using MKL
Outline I. CSLM Algorithm II. Use of CUDA III. CUDA Architecture IV. CUDA Implementation of CSLM V. Results VI.Conclusions
Conclusions A framework has been provided to introduce CUDA to the CSLM and a time savings over the traditional CPU approach has been demonstrated. CUBLAS & NPP libraries provide a good starting point for the use of GPUs For best performance, avoid redundant uploading and downloading of interim results.
Conclusions (Cont’d) GPUs provide a substantial performance benefit at relatively low cost, making high performance computing accessible to the average user. The availability of GPUs on laptops may make it more appealing and practical than a supercomputer in some applications.