Intel Threading Building Blocks TBB

Intel Threading Building Blocks TBB http://www.threadingbuildingblocks.org

Overview • TBB enables you to specify tasks instead of threads • TBB targets threading for performance • TBB is compatible with other threading packages • TBB emphasize scalable, data parallel programming • TBB relies on generic programming

Example 1: Compute Average of 3 numbers void SerialAverage(float *output,float *input, size_t) { for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }

#include "tbb/parallel_for.h" #include "tbb/blocked_range.h" #include "tbb/task_scheduler_init.h" using namespace tbb; class Average { public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }; // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg ); }

//! Problem size const int N = 100000; int main( int argc, char* argv[] ) { float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1; task_scheduler_init ; ............ ............ ParallelAverage(output, padded_input, N); }

Serial Parallel

Notes on Grain Size

Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)

Auto Partitioner

Reduction Serial Parallel

Class for use by Parallel Reduce

Split-Join Sequence

Parallel_scan (parallel prefix)

Parallel Scan

Parallel Scan with Partitioner • Parallel_scan, breaks a range into subranges and computes a • partial result in each subrange in parallel. • Then, the partial result for subrange k is used to update the • information in subrange k+1, starting from k=0 and • proceeding sequentially up to the last subrange. • Finally, each subrange uses its updated information to compute its • final result in parallel with all the other subranges.

Parallel Scan Requirements

Parallel Scan

Parallel While and Pipeline

Linked List Example • Assume Foo takes at least a few thousand instructions to run, • then it is possible to get speedup by parallelizing

Parallel_while • Requires two user-defined objects • Object that defines the stream of objects • - must have pop_if_present • - pop_if_present need not be thread safe • - nonscalable, since fetching is serialized • - possible to get useful speedup • 2. Object that defines the loop body i.e. the operator()

Stream of Objects

Loop Body (Operator)

Parallelized Foo Acting on Linked List • Note: the body of parallel_while can add more work by calling • w.add(item)

Notes on parallel_while scaling

Pipelining • Single pipeline

Pipelining • Parallel pipeline

Pipelining Example

Character Buffer

Top Level Code for for Building and Running the Pipeline

Non-Linear Pipes

Trick soln: Use Topologically Sorted Pipeline • Note that the latency is increased

parallel_sort

Intel Threading Building Blocks TBB

Intel Threading Building Blocks TBB

Presentation Transcript

The Building Blocks

Intel Multimedia Extensions and Hyper-Threading

Mississippi Building Blocks

Threading Building Blocks

Studio Building Blocks

Threading Building Blocks

Intel ® Threading Building Blocks

Dynamical building blocks

Delaware Building BLOCKS

Building Blocks

Building Blocks

Chemistry Building Blocks

The Building Blocks

TBB

Organic Building Blocks

The Building Blocks

custom building blocks