GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators Isaac Gelado, John E. Stone, Javier Cabezas, Nacho Navarro and Wen-mei W. Hwu GTC 2010

GMAC in a nutshell • GMAC: Unified Virtual Address Space for CUDA • Simplifies the CPU code • Exploits advanced CUDA features for free • Vector addition example • Really simple kernel code • But, what about the CPU code? __global__ void vector(float *c, float *a, float *b, size_t size) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if(idx < size) c[idx] = a[idx] + b[idx]; } GTC 2010

CPU CUDA code (I) • Read from disk, transfer to GPU and compute intmain(intargc, char *argv[]) { float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c; size_t size = LENGTH * sizeof(float); assert((h_a = malloc(size) != NULL); assert((h_b = malloc(size) != NULL); assert((h_c = malloc(size) != NULL); assert(cudaMalloc((void **)&d_a, size) == cudaSuccess)); assert(cudaMalloc((void **)&d_b, size) == cudaSuccess)); assert(cudaMalloc((void **)&d_c, size) == cudaSuccess)); read_file(argv[A], h_a); read_file(argv[B], h_b); assert(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice) == cudaSuccess); assert(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice) == cudaSuccess); GTC 2010

CPU CUDA code (and II) • Read from disk, transfer to GPU and compute Db(BLOCK_SIZE); Dg(LENGTH / BLOCK_SIZE); if(LENGTH % BLOCK_SIZE) Dg.x++; vector<<<Dg, Db>>>(d_c, d_a, d_b, LENGTH); assert(cudaThreadSynchronize() == cudaSuccess); assert(cudaMemcpy(d_c, h_c, LENGTH * sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess); save_file(argv[C], h_c); free(h_a); cudaFree(d_a); free(h_b); cudaFree(d_b); free(h_c); cudaFree(d_c); return 0; } GTC 2010

CPU GMAC code intmain(intargc, char *argv[]) { float *a, *b, *c; size_t size = LENGTH * sizeof(float); assert(gmacMalloc((void **)&a, size) ==gmacSuccess)); assert(gmacMalloc((void **)&b, size) ==gmacSuccess)); assert(gmacMalloc((void **)&c, size) ==gmacSuccess)); read_file(argv[A], a); read_file(argv[B],b); Db(BLOCK_SIZE); Dg(LENGTH / BLOCK_SIZE); if(LENGTH % BLOCK_SIZE) Dg.x++; vector<<<Dg, Db>>>(c, a, b, LENGTH); assert(gmacThreadSynchronize() == gmacSuccess); save_file(argv[C], c); gmacFree(a); gmacFree(b); gmacFree(c); return 0; } There is no memory copy There is no memory copy GTC 2010

Getting GMAC • GMAC is at http://adsm.googlecode.com/ • Debian / Ubuntu binary and development .deb files • UNIX (also MacOS X) source code package • Experimental versions from mercurial repository GTC 2010

Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions GTC 2010

GMAC Memory Model • Unified CPU / GPU virtual address space • Asymmetric address space accessibility Shared Data Memory CPU GPU CPU Data GTC 2010

GMAC Consistency Model • Implicit acquire / release primitives at accelerator call / return boundaries CPU ACC CPU ACC GTC 2010

GMAC Memory API • Allocate shared memory gmacError_tgmacMalloc(void **ptr, size_t size) • Allocated memory address (returned by reference) • Gets the size of the data to be allocated • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(intargc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } GTC 2010

GMAC Memory API • Release shared memory gmacError_tgmacFree(void *ptr) • Memory address to be released • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(intargc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo); } GTC 2010

GMAC Unified Address Space • Use fixed-size segments to map accelerator memory • Implement and export Accelerator Virtual Memory System Memory Accelerator Memory 0x00100000 0x00100000 CPU Accelerator GTC 2010

GMAC Memory API • Translate shared memory (multi-GPU) void *gmacPtr(void *ptr) template<typename T> T *gmacPtr(T *ptr) • Receives CPU memory address • Returns GPU memory address • Example usage #include <gmac.h> int main(int argc, char *argv[]) { . . . kernel<<<Dg, Db>>>(gmacPtr(buffer), size); . . . } GTC 2010

GMAC Example Code (I) intfdtd(FILE *fpMat, FILE *fpMed, int N) { /* Read and create data structures */ MaterialList materials if(readMaterials(fpMat, materials) == 0) return -1; Media media; if(readMedia(fpMed, media) == 0) return -1; Field field; if(createField(media.dim, field) == 0) return -1; for(int n = 0; n < N; n++) { . . . updateElectic<<<Dg, Db>>>(materials, media, field); . . . n++; updateMagnetic<<<Dg, Db>>>(materials, media, field); . . . } } GTC 2010

GMAC Example Code (II) typedefstruct { float Ke[3][3], km[3][3]; } Material; typedefstruct { size_t n; Material *data; } MaterialList; /* Read materials from disk */ size_treadMaterials(FILE *fp, MaterialList &list) { uint16_t n = 0; fread(&n, sizeof(n), 1, fp); ret = gmacMalloc((void **)&list.data, n * sizeof(Material)); if(ret != gmacSuccess) return 0; fread(list.data, sizeof(Material), n, fp); return n; } /* Read media description from file */ typedefstruct { dim3 dim; uint16_t *data } Media; void readMedia(FILE *fp, Media &media); /* Allocate a electromagnetic field */ typedefstruct{ dim3 dim; float3 *e; float3 *h; float3 *p; float3 *m } Field; void allocateField(Field &f, dim3 dim); GTC 2010

GMAC I/O Handling • Functions overridden (interposition) by GMAC: • Memory: memset(), memcpy() • I/O: fread(), fwrite(), read(), write() • MPI: MPI_Send(), MPI_Receive • Get advanced CUDA features for free • Asynchronous data transfers • Pinned memory Asynchronous Copies to device memory Pinned memory for I/O transfers GTC 2010

GMAC Example Code (III) __global__ void updateElectric(Materials mats, Media media, Field f) { intIdx = threadIdx.x + blockDim.x * blockIdx.x; intIdy = threadIdx.y + blockDim.y * blockIdx.y; for(intIdz = 0; Idz < f.dim.z; Idz++) { intpos = Idx + Idy * f.dim.x + Idz * f.dim.x * f.dim.y; float3 E = f.e[pos]; Material m = mats[media[pos]]; float3 P; P.x = E.x * m.ke[0][0] + E.y * m.ke[0][1] + E.z * m.ke[0][2]; P.y= E.x * m.ke[1][0] + E.y * m.ke[1][1] + E.z * m.ke[1][2]; P.z= E.x * m.ke[2][0] + E.y * m.ke[2][1] + E.z * m.ke[2][2]; f.p[pos] = P; } } GTC 2010

GMAC Global Memory • For multi-GPU systems • Data accessible by all accelerators, but owned by the CPU GPU Memory CPU GPU GTC 2010

GMAC Global memory API • Allocate global shared Memory gmacError_tgmacGlobalMalloc(void **ptr, size_t size) • Allocated memory address (returned by reference) • Gets the size of the data to be allocated • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacGlobalMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } GTC 2010

GMAC Example Code (I) typedefstruct { float Ke[3][3], km[3][3]; } Material; typedefstruct { size_t n; Material *data; } MaterialList; /* Read materials from disk */ size_treadMaterials(FILE *fp, MaterialList &list) { uint16_t n = 0; fread(&n, sizeof(n), 1, fp); ret = gmacGlobalMalloc((void **)&list.data, n * sizeof(Material)); if(ret != gmacSuccess) return 0; fread(list.data, sizeof(Material), n, fp); return n; } /* Read media description from file */ typedefstruct { dim3 dim; uint16_t *data } Media; void readMedia(FILE *fp, Media &media); /* Allocate a electromagnetic field */ typedefstruct{ dim3 dim; float3 *e; float3 *h; float3 *p; float3 *m } Field; void allocateField(Field &f, dim3 dim); GTC 2010

GMAC and Multi-threading • In the past, one host thread had one CPU • In GMAC, each host thread has: • One CPU • One GPU • A GMAC thread is running at GPU or at the CPU, but not in both at the same time • Create threads using what you already know • pthread_create(...) GTC 2010

GMAC and Multi-threading • Virtual memory accessibility: • Complete address space in CPU mode • Partial address space in GPU mode Memory CPU CPU GPU GPU GTC 2010

Getting Full-duplex PCIe • Use multi-threading to fully utilize the PCIe • One CPU thread launch kernels • One CPU thread writes to shared memory • Once CPU thread reads from shared memory CPU GPU System Memory GPU Memory PCIe GTC 2010

GPU Handoff and Copying • GPU handoff: • Send the thread’s virtual GPU to another thread • Do not move data, move computation • API Calls • Virtual GPU sending gmacError_tgmacSend(thread_iddest) • Virtual GPU receiving gmacError_tgmacReceive() • Virtual GPU copying gmacError_tgmacCopy(thread_iddest) GTC 2010

GPU virtual GPUs use Case • Exploit data locality in the CPU and GPU • Example: MPEG-4 Encoder: • Each GMAC thread executes one stage • Then, moves to the GPU where the input data is GPU GPU GPU GPU Dequantization and IDCT Motion Compensation Motion Estimation DCT and Quantization GTC 2010

GMAC Performance GTC 2010

GMAC on Actual Applications (I) • Reverse Time Migration (BSC / Repsol) • Six months – one programmer • Currently in use by Repsol • Single-GPU using CUDA Run-time • Can live with it: double-allocations, memory consistency • Nightmare: overlap GPU computation and data transfers (CUDA streams and double-buffering with pinned memory) • Multi-GPU using CUDA Run-time • Can live with it: lack of IDE for Linux • Nightmare: everything else Cancelled GTC 2010

GMAC on Actual Applications (II) • Multi-GPU using GMAC: • Double-buffering and pinned memory for free • Disk transfers • GPU to GPU (inter-domain) communication • MPI communication • Clean threading model • One task per CPU thread • Well-know synchronization primitives • It took shorter than the single-GPU version GTC 2010

Conclusions • Single virtual address space for CPUs and GPUs • Use CUDA advanced features • Automatic overlap data communication and computation • Get access to any GPU from any CPU thread • Get more performance from your application more easily • Go: http://adsm.googlecode.com GTC 2010

Future Features • OpenCL and Windows 7 support coming soon • Data-dependence tracking: • Avoid transferring data to the GPU when not used by kernels • Avoid transferring data to the CPU when not modified kernels • Global shared memory partitioning between multiple GPUs GTC 2010

GMACGlobal Memory for Accelerators http://adsm.googlecode.com

Backup Slides

GMAC Advanced Free Features • Get advanced CUDA features for free • Asynchronous data transfers • Pinned memory Asynchronous Copies to device memory Pinned memory for I/O transfers GTC 2010

GMAC Unified Address Space • When allocating memory • Allocate accelerator memory • Allocate CPU memory at the same virtual address System Memory Accelerator Memory CPU Accelerator GTC 2010

Lazy Update Data Transfers • Avoid unnecessary data copies • Lazy-update: • Call: transfer modified data • Return: transfer when needed System Memory Accelerator Memory CPU Accelerator GTC 2010

Rolling Update Data Transfers • Overlap CPU execution and data transfers • Minimal transfer on-demand • Rolling-update: • Memory-block size granularity System Memory Accelerator Memory CPU Accelerator GTC 2010

GMACGlobal Memory for Accelerators http://adsm.googlecode.com

GMAC Global Memory for Accelerators

GMAC Global Memory for Accelerators

Presentation Transcript

Serial Code Accelerators for Heterogeneous Multi-core Processor with 3D memory

ACCELERATORS

ACCELERATORS

Gmac Car Insurance

Accelerators

In-memory Accelerators with Memristors

ACCELERATORS

GMAC Global Memory for Accelerators

Microfabrication Techniques for Accelerators

Accelerators

Accelerators

Superconducting magnets for Accelerators

Accelerators

Accelerators

ACCELERATORS

ACCELERATORS

GMAC Hosted Meets

Magnetrons for accelerators

Accelerators

Accelerators

Superconducting Magnets for Accelerators

ACCELERATORS