1 / 56

Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela

Reverse Time Migration on GMAC. Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela Nacho Navarro. BSC Repsol /BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC. NVIDIA GTC 22 nd of September, 2010. Outline. Introduction

colum
Download Presentation

Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José María Cela

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reverse Time MigrationonGMAC Javier Cabezas Mauricio Araya Isaac Gelado Thomas Bradley Gladys González José MaríaCela Nacho Navarro BSC Repsol/BSC UPC/UIUC NVIDIA Repsol UPC/BSC UPC/BSC NVIDIA GTC22nd of September, 2010

  2. Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions

  3. Reverse Time Migration on CUDA • RTM generates an image of the subsurface layers • Uses traces recorded by sensors in the field • RTM’s algorithm Propagation of a modeled wave (forward in time) Propagation of the recorded traces (backward in time) Correlation of the forward and backward wavefields • Last forward wavefield with the first backward wavefield • FDTD are preferred to FFT 2nd-order finite differencing in time High-order finite differencing in space • RTM

  4. Introduction • BSC and Repsol: Kaleidoscope project Develop better algorithms/techniques for seismic imaging We focused on Reverse Time Migration (RTM), as it is the most popular seismic imaging technique for depth exploration • Due to the high computational power required, the project started a quest for the most suitable hardware PowerPC: scalability issues Cell: good performance (in production @ Repsol), difficult programmability FPGA: potentially best performance, programmability nightmare GPUs: 5x speedup vs Cell (GTX280), what about programmability? • Barcelona Supercomputing Center (BSC)

  5. Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions

  6. Reverse Time Migration on CUDA • We focus on the host-side part of the implementation • Avoid memory transfers between host and GPU memories Implement on the GPU as many computations as possible • Hide latency of memory transfers Overlap memory transfers and kernel execution • Take advantage of the PCIe full-duplex capabilities (Fermi) Overlap deviceToHostand hostToDevice memory transfers • General approach

  7. Reverse Time Migration on CUDA • General approach Forward Backward 3D-Stencil 3D-Stencil Absorbing Boundary Conditions Absorbing Boundary Conditions Traces insertion Source insertion Compression Read from disk Write to disk Decompression Correlation

  8. Reverse Time Migration on CUDA • Data structures used in the RTM algorithm Read/Write structures • 3D volume for the wavefield (can be larger than 1000x1000x1000 points) • State of the wavefiled in previous time-steps to compute finite differences in time • Some extra points in each direction at the boundaries (halos) Read-Only structures • 3D volume of the same size as the wavefield • Geophones’ recorded traces: time-steps x #geophones • General approach

  9. Reverse Time Migration on CUDA • Data flow-graph (forward) • General approach 3D-Stencil ABC Source Compress Wavefields Constant read-only data: velocity model, geophones’ traces

  10. Reverse Time Migration on CUDA • Simplified data flow-graph (forward) • General approach RTM Kernel Compress Wave-fields Constant read-only data: velocity model, geophones’ traces

  11. Reverse Time Migration on CUDA • Control flow-graph (forward) RTM Kernel Computation Compress and transfer to disk • deviceToHost + Disk I/O • Performed every N steps • Can run in parallel withthe next compute steps • General approach Start i = 0 RTM Kernel i%N == 0 Compress yes no toHost i < steps i++ Disk I/O yes Runs on the GPU Runs on the CPU no End

  12. Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions

  13. Reverse Time Migration on CUDA • GPU → Disk transfers are very time-consuming • Transferring to disk can be overlapped with the next (compute-only) steps • Disk I/O K1 K 2 K 3 K 4 Disk I/O C toHost K 5 time K 1 K 2 K 3 K 4 K 5 K 6 K 7 K 8 C toHost Disk I/O Runs on the GPU Runs on the CPU time

  14. Reverse Time Migration on CUDA • Single transfer: wait for all the data to be in host memory • Multiple transfers: overlap deviceToHost transfers with disk I/O Double buffering • Disk I/O deviceToHost Disk I/O time toH toH toH toH Disk I/O Disk I/O Disk I/O Disk I/O time

  15. Reverse Time Migration on CUDA • CUDA-RT limitations GPU memory accessible by the owner host thread only deviceToHost transfers must be performed by the compute thread • Disk I/O GPU GPU addressspace Computethread I/Othread CPU addressspace

  16. Reverse Time Migration on CUDA • CUDA-RT Implementation (single transfer) CUDA streams must be used not to block GPU execution Intermediate page-locked buffer must be used: for real-size problems the system can run out of memory! • Disk I/O GPU GPU addressspace CPU addressspace

  17. Reverse Time Migration on CUDA • CUDA-RT Implementation (multiple transfers) Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU Lots of synchronization code in the compute thread • Disk I/O GPU GPU addressspace CPU addressspace

  18. Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions

  19. Reverse Time Migration on CUDA • But… wait, real-size problems require > 16GB of data! • Volumes are split into tiles (along the Z-axis) 3D-Stencil introduces data dependencies • Domain decomposition D4 D2 x y z D1 D3

  20. Reverse Time Migration on CUDA • Multi-node may be required to overcome memory capacity limitations Shared memory for intra-node communication MPI for inter-node communication • Domain decomposition Node 2 Node 1 GPU1 GPU2 GPU3 GPU4 GPU1 GPU2 GPU3 GPU4 MPI Host Memory Host Memory

  21. Reverse Time Migration on CUDA • Data flow-graph (multi-domain) • Domain decomposition RTM Kernel Compress Compress RTM Kernel Wave-fields (domain 1) Wave-fields (domain 2) Constant read-only data: velocity model, geophones’ traces

  22. Reverse Time Migration on CUDA • Control flow-graph (multi-domain) Boundary exchange everytime-step • Inter-domain communicationblocks execution of the nextsteps! • Domain decomposition Start i = 0 Kernel sync Exchange s%N == 0 Compress yes no toHost i < steps i++ Disk I/O yes Runs on the GPU Runs on the CPU no End

  23. Reverse Time Migration on CUDA • Boundary exchange every time-step is needed • Domain decomposition K 1 X K 2 X K3 X K4 X C K 5 X K 6 X K 7 toHost Disk I/O time

  24. Reverse Time Migrationon CUDA • Single-transfer exchange “Easy”toprogram, needslarge page-locked buffers • Multiple-transfer exchangetomaximize PCI-Express utilization “Complex” to program, needssmaller page-locked buffers • Domaindecomposition deviceToHost deviceToHost deviceToHost hostToDevice hostToDevice hostToDevice time toH toH toH toH toH toH toH toH toH toH toH toH toD toD toD toD toD toD toD toD toD toD toD toD time

  25. Reverse Time Migration on CUDA • CUDA-RT limitations Each host thread can only access to the memory objects it allocates • Domain decomposition GPU1 GPU2 GPU3 GPU4 GPUs’ addressspaces CPU addressspace

  26. Reverse Time Migration on CUDA • CUDA-RT implementation (single-transfer exchange) Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big • Domain decomposition GPU1 GPU2 GPU3 GPU4 GPUs’ addressspaces CPU addressspace

  27. CUDA-RT implementation (multiple-transfer exchange) Uses small page-locked buffers More synchronization code • Too complex to be represented using Powerpoint! • Verydifficulttoimplement in real code! • Domaindecomposition

  28. Outline • Introduction • Reverse Time Migration on CUDA General approach Disk I/O Domain decomposition Overlapping computation and communication • GMAC at a glance • Reverse Time Migration on GMAC • Conclusions

  29. Reverse Time Migration on CUDA • Problem: boundary exchange blocks the execution of the following time-step • Overlapping computation and communication K 1 X K 2 X K3 X K4 X C K 5 X K 6 X K 7 toHost Disk I/O time

  30. Reverse Time Migration on CUDA • Solution: with a 2-stage execution plan we can effectively overlap the boundary exchange between domains • Overlapping computation and communication k1 K 1 k 2 K 2 k 3 K 3 k 4 K 4 C k9 K 9 k 5 K 5 k 6 K 6 k 7 K 7 k 8 K 8 C X X X X X X X X X toHost toHost Disk I/O Disk I/O Disk I/O time

  31. Reverse Time Migration on CUDA • Overlapping computation and communication • Approach: two-stage execution Stage 1: compute the wavefield points to be exchanged x y z GPU1 GPU2

  32. Reverse Time Migration on CUDA • Overlapping computation and communication • Approach: two-stage execution Stage 2: Compute the remaining points while exchanging the boundaries x y z GPU1 GPU2

  33. Reverse Time Migration on CUDA • But two-stage execution requires more abstractions and code complexity An additional stream per domain • We already have 1 to launch kernels, 1 to overlap transfers to disk, 1 to exchange boundaries • At this point the code is a complete mess! Requires 4 streams per domain, many page-locked buffers, lots of inter-thread synchronization • Poor readability and maintainability • Easy to introduce bugs • Overlapping computation and communication

  34. Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance Features Code examples • Reverse Time Migration on GMAC • Conclusions

  35. GMAC at a glance • Library that enhances the host programming model of CUDA • Freely available at http://code.google.com/p/adsm/ Developed by BSC and UIUC NCSA license (BSD-like) Works in Linux and MacOS X (Windows version coming soon) • Presented in detail tomorrow at 9 am @ San Jose Ballroom • Introduction

  36. GMAC at a glance • Unified virtual address space for all the memories in the system Single allocation for shared objects • Special API calls: gmacMalloc, gmacFree GPU memory allocated by a host thread is visible to all host threads Brings POSIX thread semantics back to developers • Features Shared Data Memory GPU CPU CPU Data

  37. GMAC at a glance • Parallelism exposed via regular POSIX threads Replaces the explicit use of CUDA streams OpenMP support • GMAC uses streams and page-locked buffers internally Concurrent kernel execution and memory transfers for free • Features GPU

  38. GMAC at a glance • Optimized bulk memory operations via library interposition File I/O • Standard I/O functions: fwrite, fread • Automatic overlap of Disk I/O and hostToDevice and deviceToHost transfers Optimized GPU to GPU transfers via regular memcpy Enhanced versions of the MPI send/receive calls • Features

  39. Outline • Introduction • Reverse Time Migration on CUDA • GMAC at a glance Features Code examples • Reverse Time Migration on GMAC • Conclusions

  40. GMAC at a glance • Single allocation (and pointer) for shared objects • Examples CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 }

  41. GMAC at a glance • Optimized support for bulk memory operations • Examples CUDA-RT GMAC void compute(FILE *file, int size) { 1 float *foo, *dev_foo; 2 foo = malloc(size); 3 fread(foo, size, 1, file); 4 cudaMalloc(&dev_foo, size); 5 cudaMemcpy(dev_foo, foo, size, ToDevice); 6 kernel<<<Dg, Db>>>(dev_foo, size); 7 cudaThreadSynchronize(); 8 cudaMemcpy(foo, dev_foo, size, ToHost); 9 cpuComputation(foo); 10 cudaFree(dev_foo); 11 free(foo); } void compute(FILE *file, int size) { 1 float *foo; 2 foo = gmacMalloc(size); 3 fread(foo, size, 1, file); 4 5 6 kernel<<<Dg, Db>>>(foo, size); 7 gmacThreadSynchronize(); 8 9 cpuComputation(foo); 10 gmacFree(foo); 11 }

  42. Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions

  43. Reverse Time Migration on GMAC • CUDA-RT Implementation (multiple transfers) Besides launching kernels, the compute thread must program and monitor several deviceToHost transfers while executing the next compute-only steps on the GPU Lots of synchronization code in the compute thread • Disk I/O GPU GPU addressspace CPU addressspace

  44. Reverse Time Migration on GMAC • GMAC implementation deviceToHosttransfers performed by the I/O thread deviceToHost and Disk I/Otransfers overlap for free Small page-locked buffers are used • Disk I/O (GMAC) GPU Global addressspace

  45. Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions

  46. Reverse Time Migration on GMAC • CUDA-RT implementation (single-transfer exchange) Streams and page-locked memory buffers must be used Page-locked memory buffers can be too big • Domain decomposition (CUDA-RT) GPU1 GPU2 GPU3 GPU4 GPUs’ addressspaces CPU addressspace

  47. Reverse Time Migration on GMAC • GMAC implementation (multiple-transfer exchange) Exchange of boundaries performed using a simple memcpy! Full PCIe utilization: internally GMAC performs several transfers and double buffering • Domain decomposition (GMAC) GPU1 GPU2 GPU3 GPU4 Unified globaladdressspace

  48. Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Overlapping computation and communication Development cycle and debugging • Conclusions

  49. Reverse Time Migration on GMAC • No streams, no page-locked buffers, similar performance: ±2% Overlapping computation and communication readVelocity(velociy); cudaMalloc(&d_input, W_SIZE); cudaMalloc(&d_output, W_SIZE); cudaHostAlloc(&i_halos, H_SIZE); cudaHostAlloc(&disk_buffer, W_SIZE); cudaStreamCreate(&s1); cudaStreamCreate(&s2); cudaMemcpy(d_velocity, velocity, W_SIZE) for all time steps do launch_stage1(d_output, d_input, s1); launch_stage2(d_output, d_input, s2); cudaMemcpyAsync(i_halos, d_output, s1); cudaStreamSynchronize(s1); barrier(); cudaMemcpyAsync(d_output, i_halos, s1); cudaThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); transfer_to_host(disk_buffer); barrier_write_to_disk(); } // ... Update pointers end for fread(velocity); gmacMalloc(&input, W_SIZE); gmacMalloc(&output, W_SIZE); for all time steps do launch_stage1( output, input ); gmacThreadSynchronize(); launch_stage2( output, input ); memcpy(neighbor, output); gmacThreadSynchronize(); barrier(); if (timestep % N == 0) { compress(output, c_output); barrier_write_to_disk(); } // ... Update pointers end for CUDA-RT GMAC

  50. Outline • Introduction • GMAC at a glance • Reverse Time Migration on GMAC Disk I/O Domain decomposition Inter-domain communication Development cycle and debugging • Conclusions

More Related