330 likes | 517 Views
GKLEE : Concolic Verification and Test Generation for GPUs. Guodong Li Fujitsu Labs of America Peng Li, Geof Sawaya , and Ganesh Gopalakrishnan School of Computing, University of Utah Indradeep Ghosh and Sreeranga P. Rajan Fujitsu Labs of America Work associated with
E N D
GKLEE :Concolic Verification and Test Generation for GPUs Guodong Li Fujitsu Labs of America Peng Li, GeofSawaya, and Ganesh Gopalakrishnan School of Computing, University of Utah IndradeepGhosh and Sreeranga P. Rajan Fujitsu Labs of America Work associated with The Center for Parallel Computing (CPU), and the Gauss Group at Utah http://www.cs.utah.edu/fv/GKLEE
Motivation for this work • GPUs are exciting in so many ways • Parallelism for the masses! • Growing relevance: hand-held devices to Exascale • There are many ways to arrive at GPU code: • Write it from scratch • Various compilation approaches • Debugging GPU code is important • Library functions, students learning GPU programming, .. • Compiler transformations need to be verified, as well • We contribute GKLEE, a tool that finds real bugs • Main Take-Away Message: Formal methods can be exciting and practical in the GPU domain !!
What is GKLEE ? • A CUDA/C++ Concrete+Symbolic Execution Tool • Designers can decide which variables to declare as symbolic • Symbolic execution considers all possible values • Not just the test inputs that the designer happened to pick • This is made possible by the power of SMT (constraint) solving • Provides far more incisive coverage • Yet bugs are displayed as concrete traces • Concolic tools can also generate tests that can be run on the HW • GKLEE also models all possible schedules • E.g. Different warps executed in different orders • Helps expose bugs that are execution platform dependent • GKLEE does this very efficiently by exploring a canonical schedule
Value of GKLEE to CUDA Programmers • Finds deadlocks caused by incorrect uses of __syncthreads • GKLEE detects barriers that are not textually aligned • GKLEE can help verify functional correctness • Verification can be conducted over symbolic inputs • Detects many types of races • Shared memory races: • Intra warp under warp divergence (we call it “porting race”) • Intra-warp without warp divergence • Inter-warp races • Global memory races • GKLEE can solve control flow constraints and generate test input that exposes races (Example-5 presented later)
Value of GKLEE to CUDA Programmers (contd.) • Detects many causes of performance loss • Bank conflicts, Warp divergences, Non-coalesced mem. accesses • Currently reported as % of affected Barrier Intervals / Warps • Considers all inputs and schedules • Again, it is sufficient to analyze the canonical schedule • Multi-kernel examples with 2K threads have been verified • Additional scalability through parameterized verification (in progress)
Architecture of GKLEE • GKLEE was realized by extending KLEE (Dunbar, Cadar, Engler – OSDI 2008) • GKLEE employs symbolic virtual machine that “understands” CUDA
GKLEE through examples • Basic usage (including Emacs mode) • Example-1: Porting a prefix-sum example • Automation of these steps is in progress • Example-2: Bitonic Sort • Shows automatic test generation covering execution paths • Example-3: Deliberately introduced deadlock (Sanders/Kandrot, p. 88) • Textbook shows risk of “too much optimization”; GKLEE can be safety-net • Example-4: A multi-kernel example: (AB)^T = B^T A^T • The whole assertion was verified for 2K threads • A broken calculation immediately caught (no wading through results) • Example-5: Detecting “unexpected” bank conflicts • Code claims that all bank conflicts have been eliminated • Yet GKLEE finds bank conflicts (and provides a scenario) • Example-6: Input-dependent race/bank conflict in SDK kernel • The racing location was input-dependent (also for bank conflict) • Without symbolic analysis, nearly impossible to hit these errors
GKLEE Features not covered by these examples (see our paper) • Test generation and reduction heuristics • Scripts to convert GKLEE tests to hardware • Different kinds of races • Shared memory vs. global memory races • Intra-warp races • With warp divergence (“porting race”) • Without warp divergence • Inter-warp races • Bank conflicts and non-coalesced accesses • Computed with respect to 1.x and 2.x rules • Bugs as a function of compiler optimization level revealed • Volatile bugs • Other compilation issues • Handy emacs-mode with • Thread, block, warp stepping • Ability to see LLVM byte-codes • Trace actions wrt source code
Basic Usage • Shell mode klee-l++ <options> file.C gklee --device-capability=[0,1,2] file.o • Emacs mode Load gklee-mode.el Visit buffer file.C ESC-gr – run GKLEE View results in buffers *gklee-compile-debug*, *gklee-run-debug*, and *gklee-run*. Click on trace files in *gklee-run* and navigate traces See GKLEE manual for commands that control various views
Example-1 : Porting Prefix-Sum (Example adapted from Allinea DDT distribution) #include "cutil.h" #include "klee.h" #include <stdio.h> #include <stdlib.h> boolverify(int data[], intROM_data[], int length) { // Do a prefix-sum sequentially onto ROM_data for (inti = 1; i < length; ++i) { ROM_data[i] += ROM_data[i-1]; printf("ROM_data[%d]=%d\n", i, ROM_data[i]); } // Now, verify for (inti = 1 ; i < length; ++i) { if (data[i] != ROM_data[i] ) { printf("error, results disagree at loc %d\n", i); return false; } } return true; } //#define BLOCK_SIZE 64 #define BLOCK_SIZE 32 __global__ void prefixsumblock(int *in, int *out, int length) __global__ void correctsumends(int *ends, int *in, int *out) __global__ void gathersumends(int *in, int *out) __global__ void zarro(int *data, int length) void prefixsum(int* in, int *out, int length) { //dim3 dimGrid(blocks, 1, 1); __modify_Grid(blocks, 1); //dim3 dimBlock(BLOCK_SIZE, 1, 1); __modify_Block(BLOCK_SIZE, 1, 1); __begin_GPU(); zarro(out, length); __end_GPU(); ... }
Example-1 (contd.) intmain(intargc, char *argv[]) { int length; if (argc < 2) { length = NITEMS; } else length = atoi(argv[1]); int *data = (int*) malloc(length * sizeof(int)); int *ROM_data = (int*) malloc(length * sizeof(int)); klee_make_symbolic(data, NITEMS * sizeof(int), "data_symb"); klee_assume(data[0] != data[1]); // Copy all the symbolic stuff in! for (inti = 0; i < length; ++i) { ROM_data[i] = data[i]; } // Fun fooling compiler, making it do both paths if (data[0] < data[1]) { printf("a\n"); cudasummer(data, length); } else { printf("b\n"); cudasummer(data, length); } if (length < 1000) for (inti = 0 ; i < length; ++i) { printf("%d\n", data[i]); } // Symbolic verification verify(data, ROM_data, length); }
Example-1 (contd.) • With the indicated changes, the example can be easily verified • With the trick to force the compiler to consider both paths, we can examine the behavior under two scenarios • The kernel verifies fine • Seeded calculation bugs are easily caught (try breaking the computation)
Example-2: Bitonic Sorting • CUDA SDK 2.0 example • Can be verified for functional correctness • Concolic verifier generates 28 (or so) paths • For each conditional, GKLEE forks two executions • Test limiting heuristics are available • -Path-Reduce : • B: Item covered by some thread at least once • T : Item covered by all threads at least once
#include "cutil.h" #include "klee.h" #include "stdio.h" #ifdef _SYM #define NUM 4 #else #define NUM 6 #endif __shared__ intshared[NUM]; __device__ inline void swap(int & a, int & b) { inttmp = a; a = b; b = tmp; } __global__ void BitonicKernel(int * values) { unsigned inttid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; printf("tid: %d, blockDim: %d\n", tid, blockDim.x); __syncthreads(); Example-2: Bitonic Sorting int main() { #ifdef _SYM //__device__ intvalues[NUM]; __input__ int *values = (int *)malloc(sizeof(int) * NUM); #else __input__ intvalues[NUM] = {6, 5}; // , 2, 1, 4, 3}; //__input__ intvalues[NUM] = {6, 5, 2, 1, 4, 3}; // for debugging printf("\nInput values:\n"); for (inti = 0; i < NUM; i++) { printf("%u ", values[i]); } printf("\n"); #endif klee_make_symbolic(values, sizeof(int)*NUM, "values"); int *dvalues; cudaMalloc((void **)&dvalues, sizeof(int) * NUM); cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice); __modify_Block(NUM); __begin_GPU(); BitonicKernel(dvalues); __end_GPU(); #ifndef _SYM // for debugging for (inti = 0; i < NUM; i++) { printf("%d ", values[i]); } printf("\n"); #endif // here blockDim.x should be NUM; we use this hack for (inti = 1; i < NUM; i++) { if (dvalues[i] < dvalues[i-1]) { printf("The sorting algorithm is incorrect since values[%d] < values[%d]!\n", i, i-1); return 1; } } cudaFree(dvalues); cudaFree(values); return 0; } // Parallel bitonic sort. for (unsigned intk = 2; k <= blockDim.x; k *= 2) { for (unsigned intj = k / 2; j>0; j /= 2) { unsigned intixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); } else { if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); }} __syncthreads(); } } // Write result. values[tid] = shared[tid]; }
Example-3: Deadlock due to incorrect __syncthread call in dot-product (Illustration p.88, Sanders and Kandrot, “CUDA By Example”) // begin corrected code as suggested on page 88 while (i != 0) { if (cacheIndex < i) cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); i /= 2; } // buggy code suggested on page 88 while (i != 0) { if (cacheIndex < i) { cache[cacheIndex] += cache[cacheIndex + i]; __syncthreads(); } i /= 2; } Report: GKLEE: Thread 128 and Thread 127 encounter different barrier sequences, one hits the end of kernel, but the other does not! t128 found a deadlock: #barriers at the threads:
Example-4: Symbolic verification of multi-kernel exampleVerify symbolically that (AB)^T = B^ A^, for matrices A,B int main(intargc, char* argv[]){ // const unsigned int seed = 99; //doGkleeTransposeTest(); //doGkleeMultTest(); // A^T ... int *A, *AT; // A: [64 * 32] cudaMalloc((void **)&A, sizeof(int) * AN); cudaMalloc((void **)&AT, sizeof(int) * AN); // Make the input 'A' as symbolic... klee_make_symbolic(A, sizeof(int) * AN, "A_var"); __modify_Grid(GRIDSIZE_X, P/BLOCKSIZE);// (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE);// (8, 8) __begin_GPU(); MatTrans(A, AT); __end_GPU(); printf("After A's transpose!\n"); // B^T ... int *B, *BT; // B: [32 * 64] cudaMalloc((void **)&B, sizeof(int) * BN); cudaMalloc((void **)&BT, sizeof(int) * BN); // Make the input 'B' as symbolic... klee_make_symbolic(B, sizeof(int) * BN, "B_var"); __modify_Grid(P/BLOCKSIZE, GRIDSIZE_Y); // (1, 2) __modify_Block(BLOCKSIZE, BLOCKSIZE); // (8, 8) __begin_GPU(); MatTrans(B, BT); __end_GPU(); printf("After B's transpose!\n”); // A^T * B^T = C... int *C; cudaMalloc((void **)&C, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); // (1, 1) __modify_Block(BLOCKSIZE, BLOCKSIZE); // (8, 8) __begin_GPU(); matrixMul(AT, BT, C, P, DIM_X); __end_GPU(); printf("After AT and BT multiplication !\n"); // B * A = T int *T; cudaMalloc((void **)&T, sizeof(int) * CN); __modify_Grid(GRIDSIZE_Y, GRIDSIZE_X); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); matrixMul(B, A, T, P, DIM_X); __end_GPU(); // T^T = C' int *C_P; cudaMalloc((void **)&C_P, sizeof(int) * CN); __modify_Grid(GRIDSIZE_X, GRIDSIZE_Y); __modify_Block(BLOCKSIZE, BLOCKSIZE); __begin_GPU(); MatTrans(T, C_P); __end_GPU(); if (!matricesEquiv(C, C_P, CN)) { printf("**********************"\n); printf("Post-condition failed!\n"); printf("**********************"\n); } else { printf("**********************"\n); printf("Post-condition succeeded!\n"); printf("**********************"\n); } cudaFree(C_P); cudaFree(T); cudaFree(C); cudaFree(B); cudaFree(BT); cudaFree(A); cudaFree(AT); }
Example-4 (contd..): The actual seeded bug //--correct--> C[c + wB * ty + tx] = Csub; //--buggy-> C[c + wB * ty + tx] = ++Csub; Inside the matrix multiplication routine… Performs this verification in under a minute on a slow laptop
Example-5: Find deep race and bank conflict These are functions of input. Designer picks locns to make symbolic. int main() { __device__ unsigned intd_Histogram[BIN_COUNT]; __device__ unsigned intd_Data[DATA_N]; unsigned inth_result[BIN_COUNT]; __device__ unsigned int data[10]; klee_make_symbolic(data, sizeof(data), "input"); for (inti = 0; i < 10; i++) d_Data[i] = data[i]; inline void addData64(unsigned char *s_Hist, intthreadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }
Example-5 contd… inline void addData64(unsigned char *s_Hist, intthreadPos, unsigned int data){s_Hist[threadPos + IMUL(data, THREAD_N)]++;}// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN){constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;pos< dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 10) & 0x3FU);addData64(s_Hist, threadPos, (data4 >> 18) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}
Example-5 contd… inline void addData64(unsigned char *s_Hist, intthreadPos, unsigned int data){s_Hist[threadPos + IMUL(data, THREAD_N)]++;}// The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN){constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;pos< dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos];addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 10) & 0x3FU);addData64(s_Hist, threadPos, (data4 >> 18) & 0x3FU); addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}
Example-5 contd… // The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN){constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;pos< dataN; pos += IMUL(blockDim.x, gridDim.x)) {// d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}
Example-5 contd… // The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN){constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;pos< dataN; pos += IMUL(blockDim.x, gridDim.x)) {// d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}
Example-5 contd… // The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN){constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;pos< dataN; pos += IMUL(blockDim.x, gridDim.x)) {// d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; // threadPos + IMUL(data, THREAD_N) is being shown below... THREAD_N is 32 // threadPos + 32 * ((d_Data[pos] >> 2) & 0x3FU) threadPos + (((d_Data[pos] << 3) & 2016))addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); … } __syncthreads(); ...}
Example-5 contd… // The first 10 elements of d_Data array are symbolic__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, intdataN){constintthreadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for(intpos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x;pos< dataN; pos += IMUL(blockDim.x, gridDim.x)) {// d_Data[0] (symbolic) for thread 0, and d_Data[8] (symbolic) for thread 8 ... unsigned data4 = d_Data[pos]; // threadPos: 0 for thread 0, and threadPos: 32 for thread 8 // s_Hist[threadPos + IMUL(data, THREAD_N)]++; // threadPos + IMUL(data, THREAD_N) is being shown below... THREAD_N is 32 // threadPos + 32 * ((d_Data[pos] >> 2) & 0x3FU) threadPos + (((d_Data[pos] << 3) & 2016)) // Constraint: 0 + ((d_Data[0] << 3) & 2016) == 32 + ((d_Data[8] << 3) & 2016) // this constraint is satisfied d_Data[0] is \x50\xa4\xb8\x84, d_Data[8] is \x4c\xa0\xb4\x80 addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); …} __syncthreads(); … }
Example-5 contd… Generated test case leading to Race • ktest file : 'klee-last/test000005.ktest' • args : ['histogram64_kernel.o'] • num objects: 1 • object 0: name: 'input' • object 0: size: 40 • object 0: data: ‘\x50\xa4\xb8\x84\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x4c\xa0\xb4\x80\x00\x00\x00\x00'
Concluding Remarks • A Concolic Verifier for CUDA/C++ • Detects correctness / performance issues • High coverage, automatic test generation • Tool finds issues in well-known kernels (SDK) • Tool Demos during talk will illustrate these examples • Can provide a LiveDVD or ISO image (will be posted in the URL below…) • Our paper provides details on all the issues glossed over here • Paper, user-manual, and example code available from http://www.cs.utah.edu/fv/GKLEE • Comments / Suggestions are very welcome !!
Some Future Directions • Support for CUDA 4.0 features • Atomics + SIMD • GPU2GPU transfers • GPU + MPI • Incorporate into GPU-oriented compilation frameworks • E.g. OpenACC, others. • Suggestions are welcome
Consider an arbitrary schedule that brings the execution to the Illustrated where a race FIRST occurs The race is between A and B P1 P2 Pi Pi+1 Pj …. A B
Then clearly, the red execution is equivalent to the race-free execution, because it is occurring in the race-free region of the execution-space. P1 P2 Pi Pi+1 Pj …. A B
Our canonical schedule is shown by the dashed edges here. P1 P2 Pi Pi+1 Pj …. A B
The Extra Executions Should not matter Unless they themselves race ! But that race would then be caught ! So under the absence of ANY race, ALL schedules within a barrier interval are equivalent. P1 P2 Pi Pi+1 Pj …. A B Extra Execution