1 / 41

Parallelism in the Standard C++: What to Expect in C++ 17

Parallelism in the Standard C++: What to Expect in C++ 17. Artur Laksberg Microsoft Corp. May 8th, 2014. Agenda. Fundamentals Task regions Parallel Algorithms Parallelization Vectorization. Part 1: The Fundamentals. Renderscript. OpenMP. CUDA. C++ AMP. PPL. TBB. MPI. OpenACC.

kory
Download Presentation

Parallelism in the Standard C++: What to Expect in C++ 17

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg Microsoft Corp. May 8th, 2014

  2. Agenda • Fundamentals • Task regions • Parallel Algorithms • Parallelization • Vectorization

  3. Part 1: The Fundamentals

  4. Renderscript OpenMP CUDA C++ AMP PPL TBB MPI OpenACC OpenCL Cilk Plus GCD

  5. Parallelism in C++11/14 • Fundamentals: • Memory model • Atomics • Basics: • thread • mutex • condition_variable • async • future

  6. Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }

  7. Quicksort: Use Threads Problem 1: expensive void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); std::thread t1([&] { quicksort(v, start, pivot - 1); }); std::thread t2([&] { quicksort(v, pivot + 1, end); }); t1.join(); t2.join(); } } Problem 3: Exceptions?? Problem 2: Fork-join not enforced

  8. Quicksort: Fork-Join Parallelism parallel region void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } } task task

  9. Quicksort: Using Task Regions (N3832) parallel region void quicksort(int *v, int start, int end) { if (start < end) { task_region([&] (auto& r) { int pivot = partition(v, start, end); r.run([&] { quicksort(v, start, pivot - 1); }); r.run([&] { quicksort(v, pivot + 1, end); }); }); } } task task

  10. Under The Hood…

  11. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4

  12. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  13. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  14. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items “Thief”

  15. Fork-Join Parallelism and Work Stealing Q1: What thread runs f? Q2: What thread runs g? e(); task_region([] (auto& r) { r.run(f); g(); }); h(); e() g() f() Q3: What thread runs h? h()

  16. Work Stealing Design Choices • What Thread Executes After a Spawn? • Child Stealing • Continuation (parent) Stealing • What Thread Executes After a Join? • Stalling: initiating thread waits • Greedy: the last thread to reach join continues task_region([] (auto& r) { for(inti=0; i<n; ++i) r.run(f); });

  17. Part 2: The Algorithms

  18. Alex Stepanov: Start With The Algorithms

  19. Inspiration Performing Parallel Operations On Containers • Intel • Threading Building Blocks • Microsoft • Parallel Patterns Library, C++ AMP • Nvidia • Thrust

  20. Parallel STL • Just like STL, only parallel… • Can be faster • If you know what you’re doing • Two Execution Policies: • std:par • std::vec

  21. Parallelization: What’s a Big Deal? • Why not already parallel? std::sort(begin, end, [](int a, intb) { return a < b; }); • User-provided closures must be thread safe: int comparisons = 0; std::sort(begin, end, [&](int a, intb) { comparisons++; return a < b; }); • But also special-member functions, std::swap etc.

  22. It’s a Contract • What the user can do • What the implementer can do • Asymptotic Guarantees:std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort? • What is a valid implementation? (see next slide)

  23. Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_ti=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_ti=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); } }

  24. Execution Policies • Built-in Execution Policies: extern constsequential_execution_policyseq; extern constparallel_execution_policy par; extern constvector_execution_policyvec; • Dynamic Execution Policy: class execution_policy { public: // ... consttype_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };

  25. Using Execution Policy To Write Paralel Code std::vector<int> vec = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, vec.begin(), vec.end()); // permitting parallel execution sort(par, vec.begin(), vec.end()); // permitting vectorization as well sort(vec, vec.begin(), vec.end());

  26. Picking Execution Policy Dynamically size_tthreshold = ... execution_policyexec = seq; if(vec.size() > threshold) { exec = par; } sort(exec, vec.begin(), vec.end());

  27. Exception Handling • In C++ philosophy, no exception is silently ignored • Exception list: container of exception_ptr objects try { r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0); } catch(constexception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }

  28. Vectorization: A Tale From Agriculture

  29. A Tale From Agriculture

  30. A Tale From Agriculture

  31. Idea: Fewer Tractors, Wider Plows

  32. Vectorization: What’s a Big Deal? int a[n] = ...; int b[n] = ...; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu xmm1, XMMWORD PTR _b$[esp+eax+132] movdqu xmm0, XMMWORD PTR _a$[esp+eax+132] paddd xmm1, xmm2 paddd xmm1, xmm0 movdqu XMMWORD PTR _a$[esp+eax+132], xmm1 a[i:i+3] = b[i:i+3] + c;

  33. Vector Lane is not a Thread! • Taking locks • Thread with thread_id x takes a lock… • Then another “thread” with the same thread_id enters the lock… • Deadlock!!! • Exceptions • Can we unwind 1/4th of the stack?

  34. Vectorization: Not So Easy Any More… Aliasing? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } } mov ecx, DWORD PTR _b$[esp+esi+140] add ecx, edi add DWORD PTR _a$[esp+esi+140], ecx call func Side effects? Dependence? Exceptions?

  35. Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: ? for(int i=0; i<n; ++i) { lock.enter(); a[i] = b[i] + c; lock.release(); } for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock.enter(); a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) lock.release(); } This transformation is not safe!

  36. How Do We Get This? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } } for(int i=0; i<n; i+=4) { a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func(); } Need a helping hand from the programmer, because…

  37. Vector Loop with Parallel STL void f(int* a, int*b) { integer_iteratorbegin {0}; integer_iteratorend {n}; std::for_each( std::vec, begin, end, [&](inti) { a[i] = b[i] + c; func(); } }

  38. Parallelization vs. Vectorization Parallelization Vectorization Vector Lanes No stack Lock-step execution Very light-weight • Threads • Stack • Good for divergent code • Relatively heavy-weight

  39. When To Vectorize std::par std::vec Same as std::vec, plus: No Exceptions No Locks No/Little Divergence • No race conditions • No aliasing

  40. References • N3832: Task Region • N3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing • N3724: A Parallel Algorithms Library • N3850: Working Draft, Technical Specification for C++ Extensions for Parallelism • parallelstl.codeplex.com

More Related