560 likes | 794 Views
Reverse Time Migration on GMAC. Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro. BSC Repsol /BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC. NVIDIA GTC 22 nd of September, 2010. Outline. Introduction
E N D
Reverse Time MigrationonGMAC Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José MaríaCela Nacho Navarro BSC Repsol/BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC NVIDIA GTC22nd of September, 2010
Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions
Reverse Time Migration on CUDA • RTM generates an image of the subsurface layers • Uses traces recorded by sensors in the field • RTM’s algorithm Propagation of a modeled wave (forward in time) Propagation of the recorded traces (backward in time) Correlation of the forward and backward wavefields • Last forward wavefield with the first backward wavefield • FDTD are preferred to FFT 2nd-order finite differencing in time High-order finite differencing in space • RTM
Introduction • BSC and Repsol: Kaleidoscope project Develop better algorithms/techniques for seismic imaging We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration • Due to the high computational power required, the project started a quest for the most suitable hardware PowerPC: scalability issues Cell: good performance (in production @ Repsol), difficult programmability FPGA: potentially best performance, programmability nightmare GPUs: 5x speedup vs Cell (GTX280), what about programmability? • Barcelona Supercomputing Center (BSC)
Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions
Reverse Time Migration on CUDA • We focus on the host-side part of the implementation • Avoid memory transfers between host and GPU memories Implement on the GPU as many computations as possible • Hide latency of memory transfers Overlap memory transfers and kernel execution • Take advantage of the PCIe full-duplex capabilities (Fermi) Overlap deviceToHostand hostToDevice memory transfers • General approach
Reverse Time Migration on CUDA • General approach Forward Backward 3D-Stencil 3D-Stencil Absorbing Boundary Conditions Absorbing Boundary Conditions Traces insertion Source insertion Compression Read from disk Write to disk Decompression Correlation
Reverse Time Migration on CUDA • Data structures used in the RTM algorithm Read/Write structures • 3D volume for the wavefield (can be larger than 1000x1000x1000 points) • State of the wavefiled in previous time-steps to compute finite differences in time • Some extra points in each direction at the boundaries (halos) Read-Only structures • 3D volume of the same size as the wavefield • Geophones’ recorded traces: time-steps x #geophones • General approach
Reverse Time Migration on CUDA • Data flow-graph (forward) • General approach 3D-Stencil ABC Source Compress Wavefields Constant read-only data: velocity model, geophones’ traces
Reverse Time Migration on CUDA • Simplified data flow-graph (forward) • General approach RTM Kernel Compress Wave-fields Constant read-only data: velocity model, geophones’ traces
Reverse Time Migration on CUDA • Control flow-graph (forward) RTM Kernel Computation Compress and transfer to disk • deviceToHost + Disk I/O • Performed every N steps • Can run in parallel withthe next compute steps • General approach Start i = 0 RTM Kernel i%N == 0 Compress yes no toHost i < steps i++ Disk I/O yes Runs on the GPU Runs on the CPU no End
Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions
Reverse Time Migration on CUDA • GPU → Disk transfers are very time-consuming • Transferring to disk can be overlapped with the next (compute-only) steps • Disk I/O K1 K 2 K 3 K 4 Disk I/O C toHost K 5 time K 1 K 2 K 3 K 4 K 5 K 6 K 7 K 8 C toHost Disk I/O Runs on the GPU Runs on the CPU time
Reverse Time Migration on CUDA • Single transfer: wait for all the data to be in host memory • Multiple transfers: overlap deviceToHost transfers with disk I/O Double buffering • Disk I/O deviceToHost Disk I/O time toH toH toH toH Disk I/O Disk I/O Disk I/O Disk I/O time
Reverse Time Migration on CUDA • CUDA-RT limitations GPU memory accessible by the owner host thread only deviceToHost transfers must be performed by the compute thread • Disk I/O GPU GPU addressspace Computethread I/Othread CPU addressspace
Reverse Time Migration on CUDA • CUDA-RT Implementation (single transfer) CUDA streams must be used not to block GPU execution Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory! • Disk I/O GPU GPU addressspace CPU addressspace
Reverse Time Migration on CUDA • CUDA-RT Implementation (multiple transfers) Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU Lots of synchronization code in the compute thread • Disk I/O GPU GPU addressspace CPU addressspace
Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions
Reverse Time Migration on CUDA • But… wait, real-size problems require > 16GB of data! • Volumes are split into tiles (along the Z-axis) 3D-Stencil introduces data dependencies • Domain decomposition D4 D2 x y z D1 D3
Reverse Time Migration on CUDA • Multi-node may be required to overcome memory capacity limitations Shared memory for intra-node communication MPI for inter-node communication • Domain decomposition Node 2 Node 1 GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4 MPI Host Memory Host Memory
Reverse Time Migration on CUDA • Data flow-graph (multi-domain) • Domain decomposition RTM Kernel Compress Compress RTM Kernel Wave-fields (domain 1) Wave-fields (domain 2) Constant read-only data: velocity model, geophones’ traces
Reverse Time Migration on CUDA • Control flow-graph (multi-domain) Boundary exchange everytime-step • Inter-domain communicationblocks execution of the nextsteps! • Domain decomposition Start i = 0 Kernel sync Exchange s%N == 0 Compress yes no toHost i < steps i++ Disk I/O yes Runs on the GPU Runs on the CPU no End
Reverse Time Migration on CUDA • Boundary exchange every time-step is needed • Domain decomposition K 1 X K 2 X K3 X K4 X C K 5 X K 6 X K 7 toHost Disk I/O time
Reverse Time Migrationon CUDA • Single-transfer exchange “Easy”toprogram, needslarge page-locked buffers • Multiple-transfer exchangetomaximize PCI-Express utilization “Complex” to program, needssmaller page-locked buffers • Domaindecomposition deviceToHost deviceToHost deviceToHost hostToDevice hostToDevice hostToDevice time toH toH toH toH toH toH toH toH toH toH toH toH toD toD toD toD toD toD toD toD toD toD toD toD time
Reverse Time Migration on CUDA • CUDA-RT limitations Each host thread can only access to the memory objects it allocates • Domain decomposition GPU1 GPU2 GPU3 GPU4 GPUs’ addressspaces CPU addressspace
Reverse Time Migration on CUDA • CUDA-RT implementation (single-transfer exchange) Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big • Domain decomposition GPU1 GPU2 GPU3 GPU4 GPUs’ addressspaces CPU addressspace
CUDA-RT implementation (multiple-transfer exchange) Uses small page-locked buffers More synchronization code • Too complex to be represented using Powerpoint! • Verydifficulttoimplement in real code! • Domaindecomposition
Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions
Reverse Time Migration on CUDA • Problem: boundary exchange blocks the execution of the following time-step • Overlapping computation and communication K 1 X K 2 X K3 X K4 X C K 5 X K 6 X K 7 toHost Disk I/O time
Reverse Time Migration on CUDA • Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains • Overlapping computation and communication k1 K 1 k 2 K 2 k 3 K 3 k 4 K 4 C k9 K 9 k 5 K 5 k 6 K 6 k 7 K 7 k 8 K 8 C X X X X X X X X X toHost toHost Disk I/O Disk I/O Disk I/O time
Reverse Time Migration on CUDA • Overlapping computation and communication • Approach: two-stage execution Stage 1: compute the wavefield points to be exchanged x y z GPU1 GPU2
Reverse Time Migration on CUDA • Overlapping computation and communication • Approach: two-stage execution Stage 2: Compute the remaining points while exchanging the boundaries x y z GPU1 GPU2
Reverse Time Migration on CUDA • But two-stage execution requires more abstractions and code complexity An additional stream per domain • We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries • At this point the code is a complete mess! Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization • Poor readability and maintainability • Easy to introduce bugs • Overlapping computation and communication
Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance Features Code examples • Reverse Time Migration on GMAC • Conclusions
GMAC at a glance • Library that enhances the host programming model of CUDA • Freely available at http://code.google.com/p/adsm/ Developed by BSC and UIUC NCSA license (BSD-like) Works in Linux and MacOS X (Windows version coming soon) • Presented in detail tomorrow at 9 am @ San Jose Ballroom • Introduction
GMAC at a glance • Unified virtual address space for all the memories in the system Single allocation for shared objects • Special API calls: gmacMalloc, gmacFree GPU memory allocated by a host thread is visible to all host threads Brings POSIX thread semantics back to developers • Features Shared Data Memory GPU CPU CPU Data
GMAC at a glance • Parallelism exposed via regular POSIX threads Replaces the explicit use of CUDA streams OpenMP support • GMAC uses streams and page-locked buffers internally Concurrent kernel execution and memory transfers for free • Features GPU
GMAC at a glance • Optimized bulk memory operations via library interposition File I/O • Standard I/O functions: fwrite, fread • Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers Optimized GPU to GPU transfers via regular memcpy Enhanced versions of the MPI send/receive calls • Features
Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance Features Code examples • Reverse Time Migration on GMAC • Conclusions
GMAC at a glance • Single allocation (and pointer) for shared objects • Examples CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 }
GMAC at a glance • Optimized support for bulk memory operations • Examples CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 }
Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions
Reverse Time Migration on GMAC • CUDA-RT Implementation (multiple transfers) Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU Lots of synchronization code in the compute thread • Disk I/O GPU GPU addressspace CPU addressspace
Reverse Time Migration on GMAC • GMAC implementation deviceToHosttransfers performed by the I/O thread deviceToHost and Disk I/Otransfers overlap for free Small page-locked buffers are used • Disk I/O (GMAC) GPU Global addressspace
Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions
Reverse Time Migration on GMAC • CUDA-RT implementation (single-transfer exchange) Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big • Domain decomposition (CUDA-RT) GPU1 GPU2 GPU3 GPU4 GPUs’ addressspaces CPU addressspace
Reverse Time Migration on GMAC • GMAC implementation (multiple-transfer exchange) Exchange of boundaries performed using a simple memcpy! Full PCIe utilization: internally GMAC performs several transfers and double buffering • Domain decomposition (GMAC) GPU1 GPU2 GPU3 GPU4 Unified globaladdressspace
Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions
Reverse Time Migration on GMAC • No streams, no page-locked buffers, similar performance: ±2% Overlapping computation and communication readVelocity(velociy); cudaMalloc(&d_input, W_SIZE); cudaMalloc(&d_output, W_SIZE); cudaHostAlloc(&i_halos, H_SIZE); cudaHostAlloc(&disk_buffer, W_SIZE); cudaStreamCreate(&s1); cudaStreamCreate(&s2); cudaMemcpy(d_velocity, velocity, W_SIZE) for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointers end for fread(velocity); gmacMalloc(&input, W_SIZE); gmacMalloc(&output, W_SIZE); for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); barrier_write_to_disk(); } // ... Update pointers end for CUDA-RT GMAC
Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Inter-domain communication Development cycle and debugging • Conclusions