Parallelism in C++ using the Concurrency Runtime

Parallelism in C++ using the Concurrency Runtime Don McCrady, Principal Development LeadParallel Computing Platform June 7-10

Topics • Some cool new C++ • Parallel Iteration • Tricks for reducing shared state • Asynchronous Agents • Concurrent containers

Demo: N-Bodies

Scale to many cores

Concurrency Runtime • Part of the C++ Runtime • No new libraries to link in • PPL: Parallel Pattern Library • Agents: Asynchronous Agents Library • Abstracts away the notion of threads • Tasks are computations that may be run in parallel • Use PPL & Agents to express your potential concurrency • Let the runtime map it to the available concurrency • Scale from 1 to 256 cores

Lambdas – Cool New C++ class _FT { public: _FT(int x, int& y) : _x(x), _y(y) { } void operator()(intz) { _y += _x - z; } private: int_x; int& _y; }; intx = 5; inty = 7; _FT functor(x, y); functor(3); cout << y; • intx = 5; • inty = 7; • autofunctor = • [x,&y] (intz) { • y += x - z; • }; • functor(3); • cout << y;

Lambdas – Functional Programming Lambdas make functional programming palatable in C++ • #include <vector> • #include <algorithm> • using namespace std; • vector<int>v = …; • foreach(v.begin(), v.end(), [&v] (intitem) { • cout << item << endl; • });

parallel_for parallel_for iterates over a range in parallel • #include <ppl.h> • using namespace Concurrency; • parallel_for(0, 1000, [] (inti) { • work(i); • });

parallel_for parallel_for(0, 1000, [] (inti) { work(i); }); • Order of iteration is indeterminate. • Cores may come and go. • Ranges may be stolen by newly idle cores. Core 1 Core 2 work(0…249) work(250…499) Core 3 Core 4 work(500…749) work(750…999)

parallel_for • parallel_for considerations: • Designed for unbalanced loop bodies • An idle core can steal a portion of another core’s range of work • Supports cancellation • Early exit in search scenarios • For fixed-sized loop bodies that don’t need cancellation, consider parallel_for_fixed from the sample pack.

parallel_for: Tips Parallelize outer loops first Usually plenty of outer loop iterations to spread out to all cores Inner loops do sufficient work to overcome parallel overheads • parallel_for(0, yBound, [] (inty) { • for (intx=0; x < xBound; ++x) { • complex c(minReal + deltaReal * x, • minImag + deltaImag * y); • Color pixel = ComputeMandelBrotColor(c); • … • } • });

parallel_for_each parallel_for_each iterates over an STL container in parallel • #include <ppl.h> • using namespace Concurrency; • vector<int>v = …; • parallel_for_each(v.begin(), v.end(), [] (inti) { • work(i); • });

parallel_for_each: Tips • Works best with containers that support random-access iterators: • std::vector, std::array, std::deque, Concurrency::concurrent_vector, … • Works okay, but with higher overhead on containers that support forward (or bi-di) iterators: • std::list, std::map, …

Shared State Shared state kills scalability of parallel iteration critical_sectioncs; double sum = 0; parallel_for(0, 1000, [&sum, &cs] (inti) { cs.lock(); if (SomeCondition(i)) sum += SomeComputation(i); cs.unlock(); SomeFurtherComputation(i); }); • High contention: entire loop is serialized. • Cache thrashing. • Potential thread explosion.

Shared State Reduce contention if possible. critical_sectioncs; double sum = 0; parallel_for(0, 1000, [&sum, &cs] (inti) { if (SomeCondition(i)) { cs.lock(); sum += SomeComputation(i); cs.unlock(); } SomeFurtherComputation(i); }); • Contention potentially reduced by moving lock inside the if-statement. • Still thrashes the cache.

Shared State Use combinable for per-thread computations. Each thread has its own state; no shared state. Operations must be commutative. combinable<double> sums; parallel_for(0, 1000, [&sums] (inti) { if (SomeCondition(i)) sums.local() += SomeComputation(i); SomeFurtherComputation(i); }); double sum = sums.combine(std::plus<double>()); • Practically zero contention. • No cache thrashing.

Demo: Relatively Prime Numbers

Messaging and Agents • Not all patterns map to loops or tasks. • Pipelines, state machines, producer/consumer • Agent: an asynchronous object that communicates through message passing. • Message Blocks: participants in message-passing which transport from source to target. • Message: encapsulates state that is transferred between message blocks.

Asynchronous Agents Library Message blocks for storing data • unbounded_buffer<T> • overwrite_buffer<T> • single_assignment<T> Message blocks for pipelining • transformer<T,U> • call<T> Send and receive • send, asend • receive • try_receive Message blocks for joining data • choice • join

Simple Agents Example unbounded_buffer “glorp”` propagate “glorp”` send transformer (reverse) “prolg”` propagate receive

Simple Agents Example: ReverserAgent • classReverserAgent: publicConcurrency::agent • { • private: • transformer<string,string> reverser; • public: • unbounded_buffer<string> inputBuffer; • ReverseAgent() • : reverser([] (string in) -> string { • string reversed(in); • reverse(reversed.begin(), reversed.end()); • returnreversed; • }) • { • inputBuffer.link_target(&reverser); • } • protected: • virtualvoid run(); • };

Simple Agents Example: ReverserAgent::run • voidReverserAgent::run() { • for(;;) { • string s = receive(&reverser); • if (s == "pots") { • done(); • return; • } • cout<< "Received message : " << s << endl; • } • }

Simple Agents Example: Sending messages • void main() • { • ReverserAgentreverseAgent; • reverseAgent.start(); • for(;;) { • string s; • cin>> s; • send(reverseAgent.inputBuffer, s); • if(s == "stop") • break; • } • agent::wait(&reverseAgent); • }

Demo: String Reverse Agent

Concurrent Containers • Two thread-safe, lock-free containers provided: • concurrent_vector<T>: • Lock-free push_back, element access, and iteration • No deletion! • concurrent_queue<T>: • Lock-free push and pop • Sample pack adds: • concurrent_unordered_map<T,U> • concurrent_set<T>

concurrent_vector<T> • #include<ppl.h> • #include <concurrent_vector.h> • using namespace Concurrency; • concurrent_vector<int> carmVec; • parallel_for(2, 5000000, [&carmVec](inti) { • if (is_carmichael(i)) • carmVec.push_back(i); • });

concurrent_queue<T> • #include<ppl.h> • #include <concurrent_queue.h> • using namespace Concurrency; • concurrent_queue<int> itemQueue; • parallel_invoke([&itemQueue]{ // Produce 1000 items • for (inti=0; i<1000; ++i) • itemQueue.push(i); • }, • [&itemQueue] { // Consume 1000 items • for (inti=0; i<1000; ++i) { • intresult = -1; • while (!itemQueue.try_pop(result)) • Context::Yield(); • ProcessItem(result); • } • });

Take-aways • The “Many Core Shift” is happening • VS2010 with the Concurrency Runtime can help • Use PPL & Agents to express your potential concurrency • Let the runtime figure out the actual concurrency • Parallel iteration can help your application scale • Asynchronous Agents provide isolation from shared state • Concurrent collections are scalable and lock-free

Resources • Parallel Computing Developer Center http://msdn.com/Concurrency • ConcRT Sample Pack http://code.msdn.com/concrtextras • Native Concurrency Blog http://blogs.msdn.com/nativeconcurrency • Forums http://social.msdn.microsoft.com/Forums/en-US/category/parallelcomputing

Q&A

Backup: is_carmichael() • boolis_carmichael(constint n) { • if (n < 2) { return false; } • intk = n; • for (inti = 2; i <= k / i; ++i) { • if (k % i == 0) { • if ((k / i) % i == 0) { return false; } • if ((n - 1) % (i - 1) != 0) { return false; } • k /= i; • i = 1; • } • } • return k != n && (n - 1) % (k - 1) == 0; • }

Parallelism in C++ using the Concurrency Runtime

Parallelism in C++ using the Concurrency Runtime

Presentation Transcript

PARALLELISM

Parallelism

Exploiting Parallelism

Parallelism

Parallelism

Parallelism in Haskell

Parallelism

Detecting Parallelism in C Programs with Recursive Data Structures

PARALLELISM

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism in Haskell

Parallelism

Parallelism

Parallelism

Parallelism