1 / 36

STAPL The C++ Standard Template Adaptive Parallel Library

STAPL The C++ Standard Template Adaptive Parallel Library. Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger http://www.cs.tamu.edu/research/parasol. Motivation.

amelia
Download Presentation

STAPL The C++ Standard Template Adaptive Parallel Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAPLThe C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy M. Amato and Lawrence Rauchwerger http://www.cs.tamu.edu/research/parasol

  2. Motivation STAPL – C++ Standard Template Adaptive Parallel Library • Building block library • Nested parallelism • Inter-operability with existing code • Superset of STL • Portability and Performance • Layered architecture • Run-time adaptivity

  3. Philosophy • Interface Layer • STL compatible • Concurrency & Communication Layer • Generic parallelism, synchronization • Software Implementation Layer • Instantiates concurrency & communication • Machine Layer • Architecture dependent code

  4. Related Work * Parallel programming language

  5. Iterator Algorithm Container STL Overview • Data is stored in Containers • STL provides standardized Algorithms • Iteratorsbind Algorithms to Containers • are generalized pointers • Example vector<int> vect; … // initialization of ‘vect’ variable sort(vect.begin(),vect.end());

  6. STAPL Overview • Data is stored in pContainers • STAPL provides standardized pAlgorithms • pRanges bindpAlgorithms to pContainers • Similar to STL Iterators, but must also support parallelism

  7. pRange • pRange is the Parallel Counterpart of STL iterator: • Binds pAlgorithms to pContainers • Provides an abstract view of a scoped data space • data space is (recursively) partitioned into subranges • More than an iterator since it supports parallelization • Scheduler/distributor decides how computation and data structures should be mapped to the machine • Data dependences among subranges can be represented by a data dependence graph (DDG) • Executor launches parallel computation, manages communication, and enforces dependences

  8. pRange • Provides random accessto a partition of the data space • View and access provided by a collection of iterators describing pRange boundary • pRanges are partitioned into subranges • Automatically by STAPL based on machine characteristics, number of processors, partition factors, etc. • Manually according to user-specified partitions • pRange can represent relationships among subspaces as Data Dependence Graphs (DDG) ( for scheduling )

  9. Data Space pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

  10. Data Space Prange pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

  11. Data Space Prange subspace subspace subspace subspace pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

  12. Data Space Prange subspace subspace subspace subspace pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

  13. Data Space Prange subspace subspace Prange subspace pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

  14. Data Space Prange subspace subspace Prange subspace pRange • Each subspace is disjoint and could be itself a pRange • Nested parallelism stapl::pRange<stapl::pVector<int>::iterator> dataRange(segBegin, segEnd); dataRange.partition(); stapl::pRange<stapl::pVector<int>::iterator> dataSubrange = dataRange.get_subrange(3); dataSubrange.partition_like (<0.25,0.25,0.25,0.25> * size);

  15. pVector pRange STL vector pContainer • pContainer is the parallel counterpart of STL container • Provides parallel and concurrent methods • Maintains internal pRange • Updated during insert/delete operations • Minimizes redistribution • Completed: pVector, pList, pTree • Example:

  16. pAlgorithm • pAlgorithm is the parallel counterpart of STL algorithm • Parallel Algorithms take as input • pRange • Work functions that operate on subRanges and apply the work function to all subranges template<class SubRange> class pAddOne : public stapl::pFunction { public: ... void operator()(SubRange& spr) { typename SubRange::iterator i; for (i=spr.begin(); i!=spr.end(); i++) (*i)++ } } ... p_transform(pRange, pAddOne);

  17. Cluster 4 Proc 12 Proc 13 Proc 14 Proc 15 Run-Time System • Support for different architectures • HP V2200 • SGI Origin 2000, SGI Power Challenge • Support for different paradigms • OpenMP, Pthreads • MPI • Memory allocation • HOARD pAlgorithm Run-Time Cluster 1 Cluster 2 Cluster 3

  18. Run-Time System • Scheduler • Determine an execution order (DDG) • Policies: • Automatic : Static, Block, Dynamic, Partial Self Scheduling, Complete Self Scheduling • User defined • Distributor • Hierarchical data distribution • Automatic and user defined • Executor • Execute DDG • Processor assignment • Synchronization and Communication

  19. STL to STAPL Automatic Translation • C++ preprocessor converts STL code into STAPL parallel code • Iterators used to construct pRanges • User is responsible for safe parallelization #include <start_STAPL> accumulate(x.begin(), x.end(), 0); for_each(x.begin(), x.end(), foo()); #include <stop_STAPL> pi_accumulate(x.begin(), x.end(), 0); pi_for_each(x.begin(), x.end(), foo()); Preprocessing phase pRange construction • In some cases automatic • translation provides similar • performance to STAPL • written code (5% deterioration) p_accumulate(x_pRange, 0); p_for_each(x_pRange,foo());

  20. Performance: p_inner_product Experimental results on HP V2200

  21. Base (atomic) Subtrees (parallel) P2 P3 P1 pTree • Parallel Tree supports bulk commutative operations in parallel • Each processor is assigned a set of subtrees to maintain • Operations on the base are atomic • Operations on subtrees are parallel Example : Parallel Insertion Algorithm Each processor is given a set of elements • Each proc creates local buckets corresponding to the subtrees • Each processor collects the buckets that correspond to its subtrees • Elements in the subtree buckets are inserted into tree in parallel

  22. pTree • Basis for STAPL pSet, pMultiSet, pMap, pMultiMap containers • Covers all remaining STL containers • Results are sequentially consistent although internal structure may vary • Requires negligible additional memory • pTrees can be used either sequentially or in parallel in the same execution • allows switching back and forth between parallel & sequential

  23. Performance: pTree Experimental results on HP V2200

  24. Algorithm Adaptivity • Problem - Parallel algorithms are highly sensitive • Architecture – number of processors, memory interconnection, cache, available resources, etc • Environment – thread management, memory allocation, operating system policies, etc • Data Characteristics – input type, layout, etc • Solution - implement a number of different algorithms and adaptively choose the best one at run-time

  25. Adaptive Framework

  26. Case Study - Adaptive Sorting

  27. Performance: Adaptive Sorting V2200 Power Challenge Origin 2000 Performance on 10 million integers

  28. Performance: Run-Time Tests Origin 2000 if (data_type = INTEGER) radix_sort(); else if (num_procs < 5) merge_sort(); else column_sort();

  29. Performance: Molecular Dynamics* • Discrete time particle interaction simulation • Written in STL • Time steps calculate system evolution (dependence) • Parallelized within time step • STAPL utilization: • pAlgorithms: p_for_each, p_transform, p_accumulate • pContainers: pVector (push_back) • Automatic vs. Manual (5% performance deterioration ) * Code written by Danny Rintoul at Sandia National Labs

  30. Execution Time (sec) Number of particles Number of processors 1 4 8 12 16 108K 23k 2815 1102 546 386 309 627 238 132 94 86 Performance: Molecular Dynamics • 40%-49% parallelized • Input sensitive • Use pTree on rest Experimental results on HP V2200

  31. Performance - Particle Transport* • Generic particle transport solver • Regular and arbitrary grids • Numerically intensive, 25k line, C++ STAPL code • Sweep function unaware of parallel issues • STAPL utilization: • pAlgorithms: p_for_each • pContainers: pVector (for data distribution) • Scheduler: determine grid data dependencies • Executor: satisfy data dependencies * Joint effort between Texas A&M Nuclear Engineering and Computer Science, funded by DOE ASCI

  32. Performance - Particle Transport Profile and Speedups on SGI Origin 2000 using 16 processors

  33. Performance - Particle Transport Experimental results on SGI Origin 2000

  34. Summary • Parallel equivalent to STL • Many codes can immediately utilize STAPL • Automatic translation • Building block library • Portability (layered architecture) • Performance (adaptive) • Automatic recursive parallelism • STAPL performs well in small pAlgorithm test cases and large codes

  35. STAPL Status and Current Work • pAlgorithms - fully implemented • pContainers - pVector, pList, pTree • pRange - mostly implemented • Run-Time • Executor fully implemented • Scheduler fully implemented • Distributor work in progress • Adaptive mechanism (case study – sorting) • OpenMP + MPI (mixed) work in progress • OpenMP version fully implemented • MPI version work in progress

  36. http://www.cs.tamu.edu/research/parasol • Project funded by • NSF • DOE

More Related