virtual techdays

INDIA │ 18-20 august2010 virtual techdays Parallelize applications using Intel Threading Building Blocks Om Sachan│ SSG, Intel Corporation

INDIA │ 18-20 august2010 virtual techdays • Intel® Threading Building Blocks overview • Generic Parallel Algorithms • Lab: Parallelize serial application • Generic Concurrent Containers • Synchronization Primitives • Advanced Features Overview • Summary S E S S I O N A G E N D A

INDIA │ 18-20 august2010 virtual techdays • Enables you to specify tasks instead of threads • automatically maps task onto physical threads in the way that makes efficient use of processor resources • Targets threading for performance • solution for parallelizing a computationally intensive work units and preserve good scalability across various hardware • Compatible with other threading packages • work well for CPU bound tasks, not I/O bound; coexists with other threading packages • Emphasizes scalable, data parallel programming • scales well for the bigger number of processors • Relies on generic programming • Set of templates implemented in the Intel® TBB allows writing the flexible algorithms. Intel® Threading Building Blocks Overview

INDIA │ 18-20 august2010 virtual techdays • Product package includes: • Dynamic libraries (debug and release) • Header files • Sample code • Documentation: tutorial, getting started guide,reference Intel® Threading Building Blocks Overview • Supported Platforms: • IA-32, Intel64 • Parallel Studio • Intel® TBB is a set of generic algorithms and data structures (C++ templates) Trivial Intel® TBB program: #include "tbb/task_schedulerInit.h" using namespace tbb; int main () { task_scheduler_init TBB_Init; return 0; } All public classes and functions are in tbb namespace Library requires explicit initialization: at least one task_scheduler_init object must be active

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Usage Model • Algorithms and data structures that manipulate with concepts • A concept is requirements on type • A type models a concept • Program defines types required by Intel® TBB constructs • Parallel Generic Algorithms and Concurrent Containers • C++ programming experience, basic STL and basic threading knowledge are required to get started. No need to be threading Expert. • Task Scheduler • An engine to power Parallel Generic Algorithms that hide the complexity of the tasks management. Task Scheduler may be used for advanced programming when your algorithm doesn’t naturally map onto one of pre-packaged Parallel Algorithms. Threading programming and tuning experience are required. • Synchronization Primitives • The objects should be used carefully as inappropriate use of synchronization may lead to performance and correctness issues. Solid threading programming and tuning experience are required.

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Generic Parallel Algorithms

X::X (X&, Split) • Splitting constructor. Splits x into x and y • Range Concept • The type R represents recursively divisible set of values; it must model Splittable Concept • R::R (const R&) • Copy constructor • R::~R () • Destructor • bool R::is_empty() const • Returns ‘true’ if range is empty • bool R::is_divisible() const • Returns ‘true’ if range can be partitioned in to two sub-ranges • R::R (R&, Split) • Splitting constructor INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Generic Parallel Algorithms : Basic Concepts • Splittable Concept • The type X is splittable if it has a constructor that allows an instance to be split into two pieces

parallel_for Body Concept Requirements • Body::Body (const Body&) • Copy constructor • Body::~Body () • Destructor • void Body::operator() (Range&) const • Apply Body to Range INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Generic Parallel Algorithms : parallel_for Template Function • #include “tbb/ParallelFor.h” • template <Range, Body> parallel_for (const Range& range, const Body& body> • represents parallel execution of Body over each value in the Range • Range type must model Intel® Threading Building Blocks Range Concept described on the previous foil

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Example: Parallelizing Simple Loops • Task:loop over the fixed size array of elements and apply a function to each of them (iterations are independent) • Serial version of the solution: const int N = 20000000; void ChangeAarraySerial (int* array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; } } int main (){ int A[N]; for (int i = 0; i < N; i++) { A[i] = i;} ChangeArraySerial (A, N); return 0; }

#include "tbb/blocked_range.h" #include "tbb/parallel_for.h" using namespace tbb; const int IdealGrainSize = <some number>; class ChangeArray{ int* array; public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ array[i] *= 2; } } }; void ChangeArrayParallel (int* a, int n ) { parallel_for (blocked_range<int>(0, n, IdealGrainSize), ChangeArray(a)); } int main (){ int A[N]; // initialize tbb, array here… ChangeArrayParallel (A, N); return 0; } ChangeArray class models ParallelFor Body Blocked_range is a pre-packaged 1D iteration space, models Range Concept Apply change to array element in the body of operator() Call generic function Parallel_for<Range, Body>: Range  Blocked_Range Body  ChangeArray Experiment with Grain Size INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks • Parallel solution with Intel® TBB : using parallel_for

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Lab 1: • Convert Serial Matrix multiplication application into parallel application using parallel_for.

Body::Body (const Body&) • Copy constructor • Body::~Body () • Destructor • void Body::operator() (Range&) • Apply Body to Range • Splitting constructor; must be able to run concurrently with ‘join’, `operator()’ • Body::Body (const Body&, Split) • The result of rhs must be merged with result of `this` • void Body::join (const Body& rhs) INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Generic Parallel Algorithms : parallel_reduce Template Function • #include “tbb/ParallelReduce.h” • template <Range, Body> parallel_reduce (const Range& range, const Body& body > - represents parallel reduction of Body over each value in the Range • parallel_reduce Body Concept Requirements • Range type must model Intel® Threading Building Blocks Range Concept

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks • Parallel solution with Intel® TBB : using parallel_reduce #include "tbb/blocked_range.h" #include "tbb/parallel_reduce.h" using namespace tbb; const int IdealGrainSize = <some number>; class SumArray{ int* array; public: int sum; SumArray (int* a): array(a), sum(0) {} void operator()( const blocked_range<int>& r ) { for (counter i=r.begin(); i!=r.end(); i++ ){ sum += array[i]; } } SumArray (SumArray& partial_sum,split): array(partial_sum.array), sum(0) {} void join (const SumArray& partial_sum) { sum += partial_sum.sum; } }; void SumArrayParallel (int* a, int n ) { SumArray sum_array (a); parallel_reduce (blocked_range<int>(0, n, IdealGrainSize), sum_array); return sum_array.sum; } Class SumArray models parallel_reduce Body Concept Calculate partial ‘sum’ of array elements in the body of operator() Define splitting constructor Perform Reduction in the body of ‘join’ Call generic function parallel_reduce<Range, Body>

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Generic Concurrent Containers

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Concurrent Containers • Provides concurrent containers • STL containers are not thread-safe: attempt to modify them concurrently can corrupt container • Standard practice is to wrap a lock around STL containers • Turns container into serial bottleneck • Interfaces are similar to STL but don’t match 100%. • Some STL interfaces are inherently not thread-safe • Fine-grained locking or lockless implementations • Worse single-thread performance, but better scalability. • Can be used with the library, OpenMP, or native threads.

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Concurrent Containers : concurrent_hash_table • concurrent_hash_table <Key, T, HashCompare> • Maps Key to element of type T • Hash table of to std::pair <const Key, T> • You should implement HashCompare class and define 2 methods: ‘hash’ (mapping Key to hash code of type size_t), and predicate ‘equal’ (returns true if two Key’s are equal)

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Concurrent Containers : concurrent_vector • concurrent_vector <T> • Dynamically growable array of T: grow_by and grow_to_atleast • clear() method is not thread-safe with respect to resizing • ConcurrentVector never moves the element until the array cleared

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Concurrent Containers : concurrent_queue • concurrent_queue <T> • For single threaded run it supports “first-in-first-out” ordering • If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed • The type of ‘size’ is signed number: if queue is empty and size() returns ‘–n’ this means ‘n’ pops are pending • Method ‘empty’ returns true if size is a negative value

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Synchronization Primitives

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Synchronization Primitives : Mutex Concept Mutexes are C++ objects based on scoped locking pattern

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Synchronization Primitives : Mutex Flavors • spin_mutex • Non-reentrant, unfair, spins in the user space • VERY FAST in lightly contended situations; use it if you need to protect very few instructions • queuing_mutex • Non-reentrant, fair, spins in the user space • Use Queuing_Mutex when scalability and fairness are important • queuing_rw_mutex • Non-reentrant, fair, spins in the user space • spin_rw_mutex • Non-reentrant, fair, spins in the user space • Use ReaderWriterMutex to allow non-blocking read for multiple threads • mutex • Wrapper for OS sync: CRITICAL_SECTION for Windows*, pthread_mutex on Linux*

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Synchronization Primitives : Example of spin_rw_mutex • Allows multiple threads to read the protected data, but only one can exclusively change the data (writer) • Upgrade/Downgrade operations • update_to_writer: returns true if it successfully upgraded a lock without temporarily releasing the mutex • downgrade_to_reader #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ /* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ }

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Advanced Features Overview

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks Synchronization Primitives : Mutex Concept Concurrent Containers concurrent_hash_table concurrent_queue concurrent_vector Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan task_scheduler Low-Level Synchronization Primitives spin_mutex queuing_rw_mutex spin_rw_mutex mutex

INDIA │ 18-20 august2010 virtual techdays Intel® Threading Building Blocks : Summary • Scalable data-parallel decompositionproviding patterns for parallel algorithms and concurrent data structures • Paradigm of logical tasksthat are efficiently and automatically mapped onto physical threads by task scheduler • Works good for computationally intensive tasks as task schedulerefficiently load balances tasksacross the physical threads and it’s cache aware

INDIA │ 18-20 august2010 virtual techdays • Resource-1 • http://www.threadingbuildingblocks.org/ • Resource-2 • http://www.threadingbuildingblocks.org/ • You may participate in our community support web site. • Tools Knowledge Base: http://software.intel.com/en-us/articles/tools • User forums: http://software.intel.com/en-us/forums/ • Intel® Software Product support info: http://www.intel.com/software/support RESOURCES

INDIA │ 18-20 august2010 virtual techdays • Session-1 • Speaker Name • Timing • Session-2 • Speaker Name • Timing • Session-3 • Speaker Name • Timing RELATED CONTENT

THANKS│18-20 august2010 virtual techdays email id│om.p.sachan@intel.com

virtual techdays