Houston Tech Fest 2011 Scalable Concurrent C++ Using Microsoft ConcRT and AMP

Houston Tech Fest 2011Scalable Concurrent C++ Using Microsoft ConcRT and AMP Presented by David Cravey 10/15/2011

About Me – David Cravey • Started programming in 4th grade • Learned BASIC on a V-Tech “Precomputer 1000” and then GW-BASIC, and eventually QuickBasic • Got bored with BASIC in 8th Grade so moved to C++ • Software Development Manager at Vivicom • President of the Houston C++ User Group • Meets at Microsoft’s Houston Office • 1st Thursday of Each Month @ 7PM • Microsoft Visual C++ MVP

Agenda • Why C++? • Concurrent Runtime • Tasks • PPL • Agents • GPGPU • AMP • Resources • Summary

C++ The language of power!

Why C++ ? • C++ Provides • Speed • Down to the metal performance! • Access to the Latest Hardware and Drivers • Example: GPGPU • Multi-paradigm Programming • Procedural • Object Oriented • Generic Programming • High Level Programming (i.e. Strong Abstractions) • Classes AND Templates • But still allows you to step down to Low Level as needed! • Portable Code

Modern C++: Clean Safe Fast *Used with permission from Herb Sutter’s “Writing modern C++ code: how C++ has evolved over the years” http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-835T

Automatic Memory Management • Never type “delete” again! unique_ptr< > shared_ptr< > weak_ptr< >

What’s Different: At a Glance T* shared_ptr<T> new make_shared • Then • Now auto type deduction for/while/do std:: algorithms[&] lambda functions no need for “delete” automatic lifetime management exception-safe not exception-safe missing try/catch, __try/__finally circle* p = newcircle( 42 ); vector<shape*> vw= load_shapes(); for( vector<circle*>::iterator i = vw.begin(); i != vw.end(); ++i ) { if( *i && **i == *p ) cout << **i << “ is a match\n”;} for( vector<circle*>::iterator i = vw.begin();i != vw.end(); ++i ) {delete *i;} delete p; auto p = make_shared<circle>( 42 ); vector<shared_ptr<shape>> vw= load_shapes(); for_each( begin(vw), end(vw), [&]( shared_ptr<circle>& s ) { if( s && *s == *p ) cout << *s << “ is a match\n”;} ); *Used with permission from Herb Sutter’s “Writing modern C++ code: how C++ has evolved over the years” http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-835T

Concurrency Because processors will keep getting more cores … but not very many more GHz!

Why Concurrency? You can deal with problems faster if you have more threads (or “light sabers”)!!! My HERO!

Why A Concurrency Runtime? • According to the MSDN: http://msdn.microsoft.com/en-us/library/ee207192.aspx • A runtime for concurrency provides uniformity and predictability to applications and application components that run simultaneously. • (i.e. Without a single concurrency runtime various libraries and routines will end up “competing” instead of “cooperating” for processor resources.)

Without a Concurrency Runtime Threads will compete for system resources and the program will run slower instead of faster!!!! OUCH!

With a Concurrency Runtime Threads will cooperate to make maximum use of system resources and the program will faster!!!! Success!

What does ConcRT Provide? • Improved use of processing resources • Cooperative Task Scheduling • Cooperative Blocking • Work Stealing Task Queues • Low Level Building Blocks • Synchronization Primitives • Task Schedulers • Resource Managers • 2 High Level Libraries • PPL – Parallel Patterns Library • Agents – Asynchronous Agents Library • Concurrent Container and Message Passing Libraries

ConcRT Architecture Diagram (Diagram taken from MSDN http://msdn.microsoft.com/en-us/library/ee207192.aspx)

ConcRT Task’s MSDN - http://msdn.microsoft.com/en-us/library/dd492427.aspx • Basic building block for concurrency under ConcRT • A Task is a unit of work that performs a specific job • Tasks can be further broken down into more fine grain tasks (fork and join on “child” tasks) • Tasks are kinds like very light weight Threads • Threads normally reserve 1MB of memory for their stacks. • Thread context switches eat processing time reducing throughput

Work Stealing • When a running task creates additional tasks it adds them to the bottom of the queue for the current Processor. • If another Processor does not have any tasks in its queue it will steal a task from the top of another Processor’s queue (the top of the queue is the least likely to still be in the other Processor’s Cache). Processor #1 Task #1 Task #1 Task #2 Processor #2 Task #3 Task #2

Synchronization Data Structures • Concurrency::critical_section • Cooperative mutual exclusion object • (yields to other tasks instead of preemting) • Concurrency::reader_writer_lock • Only allows a single writer • Allows multiple readers if no writers • Concurrency::scoped_lockand Concurrency::scoped_read_lock • RAII locking for critical_section and reader_writer_lock • Concurrency::event • Allows Tasks to signal each other that an Event has occurred

Potential Concurrency • Potential Concurrency is the concurrency that your application could have if computer could utilize it. • Tasks are lightweight so that they are “cheap” to create. This allows you create many tasks to express the Potential Concurrency of your program. • In other words … expressing the Potential Concurrency of your application Future Proofs your application!

Parallel Patterns Library Overview • Task Parallelism • Tasks and Task Groups • Concurrency::task_group • Concurrency::structured_task_group • Parallel Algorithms • Concurrency::parallel_for • Concurrency::parallel_for_each • Concurrency::parallel_invoke • Parallel Containers and Objects • Concurrency::concurrent_vector<T> • Concurrency::concurrent_queue<T> • Concurrency::combinable<T>

PPL Task Groups • Tasks are grouped by the task group they are created within. • A tasks is cancelled as a group • This is useful for operations such a search, where once the item searched for is found then all tasks that are searching should be canceled. • Note that if a Task Group is cancelled while waiting on anther Task Group to complete the Task Group that is waiting will also be cancelled.

PPL Algorithms Today • Concurrency::parallel_for • Performs parallel tasks using iteration values • (much like a normal for loop) • Concurrency::parallel_for_each • Performs parallel tasks for each item in an iterator range • (much like std::for_each) • Concurrency::parallel_invoke • Executes a set of tasks in parallel • PPL algorithms do not return until all the tasks within them complete or are canceled.

ConcRT Extras and Sample Pack • Microsoft has released the ConcRT Extras and Sample Pack to give early access to new enhancements to the ConcRT before the next version of VC++. • The ConcRT Extras and Sample Pack can be downloaded at: http://archive.msdn.microsoft.com/concrtextras • These are Template Libraries, so only need to include the header files. • Microsoft has stated they encourage users to not only use, but to modify the Libraries to learn more.

Upcoming PPL Algorithms • Currently Available as part of the ConcRT Sample Pack • Concurrency::parallel_transform • Concurrency::parallel_reduce • Concurrency::parallel_sort • Concurrency::parallel_buffered_sort • Concurrency::parallel_radixsort • Parallel Partitioners • These have been announced to be part of vNext http://blogs.msdn.com/b/nativeconcurrency/archive/2011/06/16/announcing-the-ppl-agents-and-concrt-efforts-for-v-next.aspx

PPL Containers and Objects • Concurrency::concurrent_vector<T> • Provides Concurrent Safe • Random Access, Element Access, Iterator Access/Transversal • Append • Does Not Provide Deletion Of Elements • Concurrency::concurrent_queue<T> • Provides Concurrent Safe • Enqueue and Dequeue operations • Concurrency::combinable<T> • Reuseable Thread Local Storage • Allows Associative Operations to be combined at the end of a parallel_for, parallel_for_each, etc.

Upcoming PPL Containers • Currently Available as part of the ConcRT Sample Pack • concurrent_unordered_map • concurrent_unordered_multimap • concurrent_unordered_set • concurrent_unordered_multiset • Like the new algorithms these new containers have been announced to be part of vNext http://blogs.msdn.com/b/nativeconcurrency/archive/2011/06/16/announcing-the-ppl-agents-and-concrt-efforts-for-v-next.aspx

When To Use PPL • When you have reasonably large tasks that can be processed in parallel • This often requires that you change your algorithm to be parallel-able (for example using combinable<T>) • It is easy to change your existing code to use PPL to accomplish: • Parallel Sorts • Parallel Sums/Counts/Averages (use Combinable<T>) • Parallel Map/Reduce

PPL Best Practices From MSDN - http://msdn.microsoft.com/en-us/library/ff601930.aspx • Do Not Parallelize Small Loop Bodies • Express Parallelism at the Highest Possible Level • Use parallel_invoke to Solve Divide-and-Conquer Problems • Use Cancellation or Exception Handling to Break from a Parallel Loop • Understand how Cancellation and Exception Handling Affect Object Destruction • Do Not Block Repeatedly in a Parallel Loop • Do Not Perform Blocking Operations When You Cancel Parallel Work • Do Not Write to Shared Data in a Parallel Loop • When Possible, Avoid False Sharing • Make Sure That Variables Are Valid Throughout the Lifetime of a Task

DEMO Using the PPL to parallelize loops

Asynchronous Agents Overview • According to MSDN: An asynchronous agent (or just agent) is an application component that works asynchronously with other agents to solve larger computing tasks. Read File From Disk Decrypt Input Data Decompress Input Data Process File Data Transmit Output Data Encrypt Output Data Compress Output Data

Agent Message Passing • Programming Model • Message Passing Based “Life Cycle” Pattern • Asynchronous Message Blocks • Concurrency::unbounded_buffer<T> • Concurrency::overwrite_buffer<T> • Concurrency::single_assignment<T> • Message Passing Functions • Concurrency::send<T> • Concurrency::asend<T> • Concurrency::receive<T> • Concurrency::try_receive<T>

Agent Message Passing Diagram (Diagram taken from MSDN http://msdn.microsoft.com/en-us/library/ee207192.aspx)

When to use Asynchronous Agents • When you have multiple processing steps that can work in parallel to process data as a pipeline • (i.e. when you can arrange your code to work as an assembly line such that you can achieve parallelism) • Examples: • Image Processing • Large Calculations That Build Upon Previous Calculations

Heterogeneous Computing Programming the GPU using AMP

The Power of Heterogeneous Computing 146X 36X 100X 19X 17X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Astrophysics N-body simulation Simulation in Matlab using .mex file CUDA function Transcoding HD video stream to H.264 149X 47X 20X 24X 30X Financial simulation of LIBOR model with swaptions Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics GLAME@lab: An M-script API for linear Algebra operations on GPU Cmatch exact string matching to find similar proteins and gene sequences source *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism” http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T

CPUs vs GPUs today • CPU • GPU images source: AMD *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism” http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming High memory bandwidth Lower power consumption High level of parallelism Shallow execution pipelines Sequential accesses Supports data-parallel code Niche programming

C++ AMP • Accelerated Massive Parallelism • Best for Data Parallelism • Bring GPGPU to the Masses • Write C++ Code that runs on the GPU • Available as part of the Visual Studio 2011 Developer Preview • http://msdn.microsoft.com/en-US/vstudio/hh127353 • When running VS11 on Win8 there is even GPGPU debugging! • Microsoft is submitting it as an Open Specification • Several other compiler vendors have committed to implementing AMP.

Hello World: Array Addition void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } } #include <amp.h> using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } ); } *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism” http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T

Links For your reference

General C++ Links Microsoft’s MSDN C++ Developer Center http://msdn.microsoft.com/en-us/visualc/default.aspx CPlusPlus.com (Great Site for quick refernce to C++ and STL) http://www.cplusplus.com/reference/stl/ Visual Studio Team Blog http://blogs.msdn.com/b/visualstudio/ Herb Sutter’s Blog (ISO C++ Chairman and Microsoft Software Architect) http://herbsutter.com/

Parallel Programming in Native Code Blog Best Way To Stay Up To Date • Parallel Programming in Native Code Blog http://blogs.msdn.com/b/nativeconcurrency/ Great tutorials and more • How to pick your parallel sort? http://blogs.msdn.com/b/nativeconcurrency/archive/2011/01/26/how-to-pick-your-parallel-sort.aspx • concurrent_vectorand concurrent_queueexplained http://blogs.msdn.com/b/nativeconcurrency/archive/2010/01/16/concurrent-vector-and-concurrent-queue-explained.aspx • Synchronization with the Concurrency Runtime (2 parts) http://blogs.msdn.com/b/nativeconcurrency/archive/2009/04/22/synchronization-with-the-concurrency-runtime.aspx • Resource Management in the Concurrency Runtime (3 parts) http://blogs.msdn.com/b/nativeconcurrency/archive/2009/03/10/resource-management-in-the-concurrency-runtime-part-1.aspx

ConcRT Written Resources MSDN - Concurrency Runtime http://msdn.microsoft.com/en-us/library/dd504870.aspx ConcRTExtras http://archive.msdn.microsoft.com/concrtextras Parallel Programming with Microsoft Visual C++ (Free Book Online, PBook and EBook not free) http://msdn.microsoft.com/en-us/library/gg675934.aspx Introducing the Visual C++ Concurrency Runtime (59 page hands on lab) http://archive.msdn.microsoft.com/cppconcrt Parallel Programming in Native Code Blog http://blogs.msdn.com/b/nativeconcurrency/

ConcRT Video Resources Don McCrady - Parallelism in C++ Using the Concurrency Runtime https://channel9.msdn.com/posts/Don-McCrady-Parallelism-in-C-Using-the-Concurrency-Runtime The Concurrency Runtime: Fine Grained Parallelism for C++ http://channel9.msdn.com/Blogs/Charles/The-Concurrency-Runtime-Fine-Grained-Parallelism-for-C Parallel Programming for C++ Developers: Tasks and Continuations (2 Parts) http://channel9.msdn.com/Shows/Going+Deep/Parallel-Programming-in-Native-Code-Tasks-and-Continuations-Part-1-of-2 Native Parallelism with the Parallel Patterns Library http://channel9.msdn.com/blogs/visualstudio/native-parallelism-with-the-parallel-patterns-library

AMP Resources Herb Sutter: Heterogeneous Computing and C++ AMP (Learn about the future of computing) http://channel9.msdn.com/posts/AFDS-Keynote-Herb-Sutter-Heterogeneous-Computing-and-C-AMP Taming GPU compute with C++ AMP http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T Walkthrough: Debugging an AMP Application http://msdn.microsoft.com/en-us/library/hh368280(VS.110).aspx Daniel Moth’s Blog (AMP Project Manager) http://www.danielmoth.com/Blog/

Conclusions • C++ is a Modern Language • C++ is the language of choice to: • Maximize Speed • Minimize Power Consumption • Target the latest hardware • Have full control of your application • Native Concurrency using C++ PPL, Agents, and AMP provide a powerful set of tools to enable you to unlock your potential concurrency!!! • C++ is AMPed!!!

Thank you for coming! Please fill out a evaluation form before you leave! If you would like a copy of this slide deck please email me at dcravey@gmail.com If you would more information please contact me or better yet, come to either the local C++ User Groups: Houston C++ User Group (1st Thursday each month) University of Houston C++ User Group (Wednesday before 1st Thursday each month)

Houston Tech Fest 2011 Scalable Concurrent C++ Using Microsoft ConcRT and AMP

Houston Tech Fest 2011 Scalable Concurrent C++ Using Microsoft ConcRT and AMP

Presentation Transcript

Blues Fest

Skanksgiving Fest

Install Fest

FEST

FEST

One Health Fest

C-Tech 2

FEST

FEST

FEST

Globe Fest

2012 Spin Fest

Ohsugi Fest

Implementation Fest 2005

FEST

October Fest

Actor-based Programming for Scalable Concurrent Systems

Game Fest

TDI Fest 2004

A … Framework for Verifying Concurrent C Programs

C & A Master Tech LLC