400 likes | 501 Views
“Going Parallel with C++11” Supercomputing 2012. Joe Hummel, PhD UC-Irvine hummelj@ics.uci.edu http://www.joehummel.net/downloads.html. C++ 11. New standard of C++ has been ratified “C++0x” ==> “C++11” Lots of new features Workshop will focus on concurrency features.
E N D
“Going Parallel with C++11”Supercomputing 2012 Joe Hummel, PhD UC-Irvine hummelj@ics.uci.edu http://www.joehummel.net/downloads.html
C++ 11 • New standard of C++ has been ratified • “C++0x” ==> “C++11” • Lots of new features • Workshop will focus on concurrency features Going Parallel with C++11
Why are we here? • Asyncprogramming: • Better responsiveness… • GUIs (desktop, web, mobile) • Cloud • Windows 8 • Parallel programming: • Better performance… • Financials • Scientific • Big data C C C C C C C C Going Parallel with C++11
Hello World --- of threading #include <thread> #include <iostream> void func() { std::cout << "**Inside thread" << std::this_thread::get_id() << "!" << std::endl; } intmain() { std::thread t; t = std::thread( func); t.join(); return 0; } A simple function for threadto do… Create and schedule thread… Wait for thread to finish… Going Parallel with C++11
Demo #1 • Hello world… Going Parallel with C++11
Avoiding early program termination… #include <thread> #include <iostream> void func() { std::cout << "**Hello world...\n"; } intmain() { std::thread t; t = std::thread( func); t.join(); return 0; } (1) Thread function must do exception handling; unhandled exceptions ==> termination… void func() { try { // computation: } catch(...) { // do something: } } (2) Must join, otherwise termination… (avoid use of detach( ), difficult to use safely) Going Parallel with C++11
Pick your style… • Old school: • distinct thread functions (what we just saw) • New school: • lambda expressions (aka anonymous functions) Going Parallel with C++11
Demo #2 • A thread that loops until we tell it to stop… When user presses ENTER, we’ll tell thread to stop… Going Parallel with C++11
(1) via thread function void loopUntil(bool*stop) { auto duration = chrono::seconds(2); while(!(*stop)) { cout<< "**Inside thread...\n"; this_thread::sleep_for(duration); } } #include <thread> #include <iostream> #include <string> using namespace std; . . . intmain() { boolstop(false); thread t(loopUntil, &stop); getchar(); // wait for user to press enter: stop= true; // stop thread: t.join(); return 0; } Going Parallel with C++11
(2) via lambda expression Closure semantics: [ ]: none, [&]: by ref, [=]: by val, … intmain() { boolstop(false); thread t = thread([&]() { auto duration = chrono::seconds(2); while (!stop) { cout<< "**Inside thread...\n"; this_thread::sleep_for(duration); } } ); getchar(); // wait for user to press enter: stop = true; // stop thread: t.join(); return 0; } lambda arguments • thread t([&] () • { • auto duration = chrono::seconds(2); • while (!stop) • { • cout<< "**Inside thread...\n"; • this_thread::sleep_for(duration); • } • } • ); lambda expression
Trade-offs • Lambdas: • Easier and more readable -- code remains inline • Potentially more dangerous ([&] captures everything by ref) • Functions: • More efficient -- lambdas involve class, function objects • Potentially safer -- requires explicit variable scoping • More cumbersomeand illegible Going Parallel with C++11
Demo #3 • Multiple threads looping… When user presses ENTER, all threads stop… Going Parallel with C++11
vector<thread> workers; for(inti = 1; i <= 3; i++) { workers.push_back( thread([i, &stop]() { while(!stop) { cout<< "**Inside thread " << i << "...\n"; this_thread::sleep_for(chrono::seconds(i)); } }) ); } Solution #include <thread> #include <iostream> #include <string> #include <vector> #include <algorithm> using namespace std; intmain() { cout<< "** Main Starting **\n\n"; boolstop = false; . . . getchar(); cout << "** Main Done **\n\n"; . . . return 0; } stop = true; // stop threads: // wait for threads to complete: for ( thread& t : workers) t.join(); Going Parallel with C++11
A Real Example • Matrix multiply… Going Parallel with C++11
Multi-threaded solution // 1 thread per core: numthreads= thread::hardware_concurrency(); introws = N / numthreads; intextra= N % numthreads; intstart= 0; // each thread does [start..end) intend = rows; vector<thread> workers; for(int t = 1; t <= numthreads; t++) { if (t == numthreads) // last thread does extra rows: end += extra; workers.push_back( thread([start, end, N, &C, &A, &B]() { for (inti = start; i < end; i++) for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } })); start = end; end = start + rows; } for (thread& t : workers) t.join();
High-Performance Computing • Parallelism alone is not enough… HPC == Parallelism + Memory Hierarchy─Contention Expose parallelism • Minimize interaction: • false sharing • locking • synchronization • Maximize data locality: • network • disk • RAM • cache • core Going Parallel with C++11
Cache-friendly matrix multiply X Going Parallel with C++11
Cache-friendly solution • Loop interchangeis first step… workers.push_back( thread([start, end, N, &C, &A, &B]() { for (inti = start; i < end; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; for (inti = start; i < end; i++) for (intk = 0; k < N; k++) for (intj = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); })); Next step is to block multiply… Going Parallel with C++11
C++11 Features and Status Going Parallel with C++11
Compilers… • No compiler as yet fully implements C++11 • Visual C++ 2012 has best concurrency support • Part of Visual Studio 2012 • gcc 4.7 has best overall support • http://gcc.gnu.org/projects/cxx0x.html • clang 3.1 appears very good as well • I did not test • http://clang.llvm.org/cxx_status.html Going Parallel with C++11
Compiling with gcc # makefile # threading library: one of these should work # tlib=thread tlib=pthread # gcc 4.6: ver=c++0x # gcc 4.7: # ver=c++11 build: g++ -std=$(ver) -Wall main.cpp -l$(tlib) Going Parallel with C++11
Executive Summary Going Parallel with C++11
Locking • Use mutexto protect against concurrent access… • #include <mutex> • mutexm; • intsum; thread t1([&]() { m.lock(); sum += compute(); m.unlock(); } ); thread t2([&]() { m.lock(); sum += compute(); m.unlock(); } ); Going Parallel with C++11
RAII • “Resource Acquisition Is Initialization” • Advocated by B. Stroustrup for resource management • Uses constructor & destructor to properly manage resources (files, threads, locks, …) in presence of exceptions, etc. Locks m in constructor thread t([&]() { lock_guard<mutex> lg(m); sum += compute(); } ); thread t([&]() { m.lock(); sum += compute(); m.unlock(); }); Unlocks m in destructor should be written as… Going Parallel with C++11
Atomics • Use atomicto protect shared variables… • Lighter-weight than locking, but much more limited in applicability • #include <atomic> • atomic<int> count; • count = 0; thread t1([&]() { count++; }); thread t2([&]() { count++; }); X thread t3([&]() { count = count + 1; }); not safe… Going Parallel with C++11
Lock-free programming • Atomics enable safe, lock-free programming • “Safe” is a relative word… • int x; • atomic<bool> initd; • initd = false; • int x; • atomic<bool> done; • done = false; thread t1([&]() { if (!initd) { lock_guard<mutex> _(m); x = 42; initd = true; } << consume x, … >> }); thread t1([&]() { x = 42; done = true; }); lazy init done flag thread t2([&]() { if (!initd) { lock_guard<mutex> _(m); x = 42; initd = true; } << consume x, … >> }); thread t2([&]() { while (!done) ; assert(x==42); });
Demo • Prime numbers… Going Parallel with C++11
Futures • Futures provide a higher-level of abstraction • Starts an asynchronous operation on some thread, await result… return type… #include <future> . . future<int>fut = async( []() -> int { int result = PerformLongRunningOperation(); return result; } ); . . try { int x = fut.get(); // join, harvest result: cout << x << endl; } catch(exception &e) { cout << "**Exception: " << e.what() << endl; } Going Parallel with C++11
Execution of futures • May run on the current thread • May run on a new thread • Often better to let system decide… // run on current thread when someone asks for value (“lazy”): future<T> fut1 = async( launch::sync, []() -> ... ); future<T> fut2= async( launch::deferred, []() -> ... ); // run on a new thread: future<T> fut3= async( launch::async, []() -> ... ); // let system decide: future<T> fut4= async( launch::any, []() -> ... ); future<T> fut5= async( []() ... ); Going Parallel with C++11
Demo Computes average review for a movie… • Netflix data-mining… Netflix Movie Reviews (.txt) Netflix Data Mining App Going Parallel with C++11
Memory Model • C++ committee thought long and hard on memory model semantics… • “You Don’t Know Jack About Shared Variables or Memory Models”, Boehm and Adve, CACM, Feb 2012 • Conclusion: • No suitable definition in presence of race conditions • Solution: • Predictable memory model *only* in data-race-free codes • Computer may “catch fire” in presence of data races Going Parallel with C++11
Example… • intx, y, r1, r2; • x = y = r1 = r2 = 0; thread t1([&]() { x = 1; r1 = y; } ); t1.join(); thread t2([&]() { y = 1; r2 = x; } ); t2.join(); What can we say about r1 and r2? Going Parallel with C++11
Dekker’s example… If we think in terms of all possible thread interleavings(aka “sequential consistency”), then we know r1 = 1, r2 = 1, or both In C++ 11? Not only are the values of x, y, r1 and r2 undefined, but the program may crash! Going Parallel with C++11
C++ 11 Memory Model • Def: two memory accesses conflict if they • access the same scalar object or contiguous sequence of bit fields, and • at least one access is a store. • A program is data-race-free(DRF) if no sequentially-consistent execution results in a data race. Avoid anything else. • Def: two memory accesses participate in a data race if they • conflict, and • can occur simultaneously. via independent threads, locks, atomics, … Going Parallel with C++11
Beyond Threads Going Parallel with C++11
Tasks vs. Threads • Tasks are a higher-level abstraction • Idea: • developers identify work • run-time system deals with execution details Task: a unit of work; an object denoting an ongoing operation or computation. Going Parallel with C++11
Example • Microsoft PPL: Parallel Patterns Library #include <ppl.h> for (inti = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; // for(inti= 0; i< N; i++) Concurrency::parallel_for(0, N, [&](inti) { for (intk = 0; k < N; k++) for (intj = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); }); Matrix Multiply Going Parallel with C++11
Execution Model parallel_for( ... ); task task task task Windows Process C C C C C C Parallel Patterns Library Thread Pool C C Task Scheduler worker thread worker thread worker thread worker thread Resource Manager Windows
That’s it! Going Parallel with C++11
Thank you for attending • Presenter: Joe Hummel • Email: hummelj@ics.uci.edu • Materials:http://www.joehummel.net/downloads.html • References: • Book: “C++ Concurrency in Action”, by Anthony Williams • Talks: Bjarne and friends at MSFT’s “Going Native 2012” • http://channel9.msdn.com/Events/GoingNative/GoingNative-2012 • Tutorials: really nice series by BartoszMilewski • http://bartoszmilewski.com/2011/08/29/c11-concurrency-tutorial/ • FAQ: BjarneStroustrup’s extensive FAQ • http://www.stroustrup.com/C++11FAQ.html Going Parallel with C++11