1 / 86

Intel Threading Building Blocks TBB

Intel Threading Building Blocks TBB. http://www.threadingbuildingblocks.org. Overview. TBB enables you to specify tasks instead of threads TBB targets threading for performance TBB is compatible with other threading packages TBB emphasize scalable, data parallel programming

Download Presentation

Intel Threading Building Blocks TBB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel Threading Building Blocks TBB http://www.threadingbuildingblocks.org

  2. Overview • TBB enables you to specify tasks instead of threads • TBB targets threading for performance • TBB is compatible with other threading packages • TBB emphasize scalable, data parallel programming • TBB relies on generic programming

  3. Example 1: Compute Average of 3 numbers void SerialAverage(float *output,float *input, size_t) { for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }

  4. #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" #include "tbb/task_scheduler_init.h" using namespace tbb; class Average { public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }; // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg ); }

  5. //! Problem size const int N = 100000; int main( int argc, char* argv[] ) { float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1; task_scheduler_init ; ............ ............ ParallelAverage(output, padded_input, N); }

  6. Serial Parallel

  7. Notes on Grain Size

  8. Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)

  9. Auto Partitioner

  10. Reduction Serial Parallel

  11. Class for use by Parallel Reduce

  12. Split-Join Sequence

  13. Parallel_scan (parallel prefix)

  14. Parallel Scan

  15. Parallel Scan with Partitioner • Parallel_scan, breaks a range into subranges and computes a • partial result in each subrange in parallel. • Then, the partial result for subrange k is used to update the • information in subrange k+1, starting from k=0 and • proceeding sequentially up to the last subrange. • Finally, each subrange uses its updated information to compute its • final result in parallel with all the other subranges.

  16. Parallel Scan Requirements

  17. Parallel Scan

  18. Parallel While and Pipeline

  19. Linked List Example • Assume Foo takes at least a few thousand instructions to run, • then it is possible to get speedup by parallelizing

  20. Parallel_while • Requires two user-defined objects • Object that defines the stream of objects • - must have pop_if_present • - pop_if_present need not be thread safe • - nonscalable, since fetching is serialized • - possible to get useful speedup • 2. Object that defines the loop body i.e. the operator()

  21. Stream of Objects

  22. Loop Body (Operator)

  23. Parallelized Foo Acting on Linked List • Note: the body of parallel_while can add more work by calling • w.add(item)

  24. Notes on parallel_while scaling

  25. Pipelining • Single pipeline

  26. Pipelining • Parallel pipeline

  27. Pipelining Example

  28. Character Buffer

  29. Top Level Code for for Building and Running the Pipeline

  30. Non-Linear Pipes

  31. Trick soln: Use Topologically Sorted Pipeline • Note that the latency is increased

  32. parallel_sort

More Related