920 likes | 1.31k Views
Intel Threading Building Blocks TBB. http://www.threadingbuildingblocks.org. Overview. TBB enables you to specify tasks instead of threads TBB targets threading for performance TBB is compatible with other threading packages TBB emphasize scalable, data parallel programming
E N D
Intel Threading Building Blocks TBB http://www.threadingbuildingblocks.org
Overview • TBB enables you to specify tasks instead of threads • TBB targets threading for performance • TBB is compatible with other threading packages • TBB emphasize scalable, data parallel programming • TBB relies on generic programming
Example 1: Compute Average of 3 numbers void SerialAverage(float *output,float *input, size_t) { for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }
#include "tbb/parallel_for.h" #include "tbb/blocked_range.h" #include "tbb/task_scheduler_init.h" using namespace tbb; class Average { public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }; // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg ); }
//! Problem size const int N = 100000; int main( int argc, char* argv[] ) { float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1; task_scheduler_init ; ............ ............ ParallelAverage(output, padded_input, N); }
Serial Parallel
Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)
Reduction Serial Parallel
Parallel Scan with Partitioner • Parallel_scan, breaks a range into subranges and computes a • partial result in each subrange in parallel. • Then, the partial result for subrange k is used to update the • information in subrange k+1, starting from k=0 and • proceeding sequentially up to the last subrange. • Finally, each subrange uses its updated information to compute its • final result in parallel with all the other subranges.
Linked List Example • Assume Foo takes at least a few thousand instructions to run, • then it is possible to get speedup by parallelizing
Parallel_while • Requires two user-defined objects • Object that defines the stream of objects • - must have pop_if_present • - pop_if_present need not be thread safe • - nonscalable, since fetching is serialized • - possible to get useful speedup • 2. Object that defines the loop body i.e. the operator()
Parallelized Foo Acting on Linked List • Note: the body of parallel_while can add more work by calling • w.add(item)
Pipelining • Single pipeline
Pipelining • Parallel pipeline
Trick soln: Use Topologically Sorted Pipeline • Note that the latency is increased