1 / 24

Introduction to C++ AMP A ccelerated M assive P arallelism

Introduction to C++ AMP A ccelerated M assive P arallelism. Marc Grégoire Software Architect marc.gregoire@nuonsoft.com http://www.nuonsoft.com / http://www.nuonsoft.com/blog /. Disclaimer: This presentation was made based on the released Visual C++ 11 Preview.

bowie
Download Presentation

Introduction to C++ AMP A ccelerated M assive P arallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to C++ AMPAccelerated Massive Parallelism Marc Grégoire Software Architect marc.gregoire@nuonsoft.com http://www.nuonsoft.com/ http://www.nuonsoft.com/blog/ Disclaimer: This presentation was made based on the released Visual C++ 11 Preview It is time to start taking advantage of the computing power of GPUs…

  2. Agenda • Introduction • Demo: N-Body Simulation • Technical • The C++ AMP Technology • Coding Demo: Mandelbrot • Summary • Resources

  3. The Power of Heterogeneous Computing 146X 36X 100X 17X 19X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Astrophysics N-body simulation Simulation in Matlab using .mex file CUDA function Transcoding HD video stream to H.264 47X 30X 149X 20X 24X Financial simulation of LIBOR model with swaptions Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics GLAME@lab: An M-script API for linear Algebra operations on GPU Cmatch exact string matching to find similar proteins and gene sequences source

  4. CPUs vs GPUs today • CPU • Low memory bandwidth • Higher power consumption • Medium level of parallelism • Deep execution pipelines • Random accesses • Supports general code • Mainstream programming • GPU • High memory bandwidth • Lower power consumption • High level of parallelism • Shallow execution pipelines • Sequential accesses • Supports data-parallel code • Niche programming images source: AMD

  5. Tomorrow… • CPUs and GPUs coming closer together… • …nothing settled in this space yet, things still in motion… • C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow images source: AMD

  6. C++ AMP • Part of Visual C++ (available nowin the Visual C++ 11 preview) • Complete Visual Studio integration • STL-like library for multidimensional data • Builds on Direct3D performance productivity portability

  7. C++ AMP • Abstracts accelerators • Current version supports DirectX 11 GPUs • Could support others like FPGA’s, off-site cloud computing… • Support heterogeneous mix of accelerators • Example: C++ AMP can use both an NVidia and AMD GPU in your system at the same time

  8. Demo… N-Body Simulation

  9. N-Body Simulation Demo

  10. Agenda • Introduction • Demo: N-Body Simulation • Technical • The C++ AMP Technology • Coding Demo: Mandelbrot • Summary • Resources

  11. C++ AMP – Basics • Everything is in the concurrency namespace • concurrency::array_view • Wraps a user-allocated buffer so that C++ AMP can use it • concurrency::array • Allocates a buffer that can be used by C++ AMP • C++ AMP automatically transfers data between those buffers and memory on the accelerators

  12. C++ AMP – array_view • Read/write buffer of given dimensionality, with elements of given type: • array_view<type, dim> • Read-only buffer: • array_view<const type, dim> • Write-only buffer: • array_view<writeonly<type>, dim>

  13. C++ AMP – parallel_for_each() • concurrency::parallel_for_each(grid, lambda); • grid is the compute domain over which you want to perform arithmetic • For array_view objects, use the array_viewgrid property • The lambda is the code to execute on the accelerator • A restriction should be applied to the lambda, restrict(direct3d), which is a compile-time check to validate the lambda for execution on direct3d accelerators • Accepts 1 parameter: the index into the compute domain (= grid) • Same dimensionality as the grid, so if grid is 2D, index is also 2D: • index<2> idx • Access the two dimensions as idx[0] and idx[1] • Example: array_view<writeonly<int>, 2> a(height, width, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { /* … */ });

  14. C++ AMP – parallel_for_each() – lambda • Several restrictions apply to the code in the lambda: • Can only call other restrict(direct3d) functions • Must capture everything by value, except concurrency::array objects • No recursion, no virtual functions, no pointers to functions, no pointers to pointers, no goto, no try/catch/throw statements, no global variables, no static variables, no dynamic_cast, no typeid, no asm, no varargs, … • restrict(direct3d), see http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/05/restrict-a-key-new-language-feature-introduced-with-c-amp.aspx + MSDN

  15. C++ AMP – parallel_for_each() – lambda • The lambda executes in parallel with CPU code that follows parallel_for_each() until a synchronization point is reached • Synchronization: • Manually when calling array_view::synchronize() • Good idea, because you can handle exceptions gracefully • Automatically when for example array_view goes out of scope • Bad idea, errors will be ignored silently because destructors are not allowed to throw exceptions • Automatically, when CPU code observes the array_view • Not recommended, because you might lose error information if there is no try/catch block catching exceptions at that point

  16. C++ AMP – accelerator / accelerator_view • The concurrency::accelerator and concurrency::accelerator_viewclasses can be used to query for information on the installed accelerators • get_accelerators() returns a vector of accelerators in the system

  17. Coding Demo… Mandelbrot

  18. Mandelbrot – Single-Threaded for (int y = -halfHeight; y < halfHeight; ++y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; floatZ_r = Z0_r; floatZ_i = Z0_i; float res = 0.0f; for (intiter = 0; iter < maxiter; ++iter) { floatZ_rSquared = Z_r * Z_r; floatZ_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned__int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; } }

  19. Mandelbrot – PPL parallel_for(-halfHeight, halfHeight, 1, [&](inty) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; floatZ_r = Z0_r; floatZ_i = Z0_i; float res = 0.0f; for (intiter = 0; iter < maxiter; ++iter) { floatZ_rSquared = Z_r * Z_r; floatZ_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned__int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; } });

  20. Mandelbrot – C++ AMP array_view<writeonly<unsigned__int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer); parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { // Formula: zi = z^2 + z0 int x = idx[1] - halfWidth; int y = idx[0] - halfHeight; float Z0_i = view_i + y * zoomLevel; float Z0_r = view_r + x * zoomLevel; floatZ_r = Z0_r; floatZ_i = Z0_i; float res = 0.0f; for (intiter = 0; iter < maxiter; ++iter) { floatZ_rSquared = Z_r * Z_r; floatZ_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - fast_log(fast_log(fast_sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned__int32 result = RGB(res * 50, res * 50, res * 50); a[idx] = result; }); a.synchronize();

  21. Mandelbrot – C++ AMP • Wrap C++ AMP code inside a try-catch block! try { array_view<writeonly<unsigned__int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer); parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { ... }); a.synchronize(); } catch (const Concurrency::runtime_exception& ex) { MessageBoxA(NULL, ex.what(), "Error", MB_ICONERROR); }

  22. Summary • C++ AMP allows anyone to make use of parallel hardware • Easy-to-use • High-level abstractions in C++ (not C) • In-depth support of C++ AMP in Visual Studio, including the debugger • Abstracts multi-vendor hardware • Intention is to make C++ AMP an open specification 

  23. Resources • Daniel Moth's blog (PM of C++ AMP), lots of interesting C++ AMP related posts • http://www.danielmoth.com/Blog/ • MSDN Native parallelism blog (team blog) • http://blogs.msdn.com/b/nativeconcurrency/ • MSDN Dev Center for Parallel Computing • http://msdn.com/concurrency • MSDN Forums to ask questions • http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

  24. Questions ? I would like to thank Daniel Moth from Microsoft for his inputs for this presentation.

More Related