1 / 32

Optimizing Data Processing with STAPL's Efficient Static Parallel Container

Explore the benefits of Standard Template Adaptive Parallel Library (STAPL) in parallel programming. Learn about pContainers, pArrays, and their performance results in solving computational tasks efficiently.

laceym
Download Presentation

Optimizing Data Processing with STAPL's Efficient Static Parallel Container

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. pArray as an Efficient StaticParallel ContainerinSTAPL(Standard Template Adaptive Parallel Library) Presenter: Olga Tkachyshyn Grad Student Advisors: Ping An, Gabriel Tanase Faculty Advisor: Nancy Amato

  2. Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work

  3. Motivation • The time it takes to complete a task is limited to the speed of the worker • Alternative: • Similarly, the time it takes to solve a problem on a computer is limited to the speed of the processor • Alternative: parallel processing or the concurrent use of multiple processors to process data

  4. Parallel/Distributed Architecture • Multiple processors are connected together • A processor can have its own memory or share the memory with another processor Processor 0 Processor 1 Processor 2 Cache 1 Cache 2 Cache 0 Memory 1 Memory 0

  5. Motivation • Powerful parallel computers can solve hard to compute problems • Computational physics • Protein folding • Parallel programming is challenging due to the communication and synchronization issues • Parallel libraries reduce the complexity of parallel programming

  6. STAPL Introduction • The Parasol Lab in the Computer Science Department at TAMU is developing a Standard Template Adaptive Parallel Library (STAPL) • STAPL is designed as a platform independent parallel library • STAPL provides a collection of parallel containers (generic distributed data structures) that are efficient and easy to use

  7. Container pContainer Runtime system that Iterator pRange optimizes performance Algorithms pAlgorithms for all platforms STL STAPL STAPL Overview • STAPL is a C++ parallel library designed as a superset of the Standard Template Library(STL). • STAPL simplifies parallel programming by letting the user ignore the distributed machine details like: • data partitioning • distribution • communication

  8. STAPL Main Components • pContainer • Generic distributed data structure • STAPL requires an efficient array data structure for numeric intensive applications • pRange • Presents an abstract view of a scoped data space, which allows random access to a partition or subrange of the data in a pContainer • pAlgorithms • Parallel Algorithms which provide basic functionality, bound with the pContainer by pRange

  9. pContainer: Basic Design BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • pContainers are data structures that allow users to store and use distributed data as if it is stored in a single memory • All pContainers have similar functionality • pContainers have three basic components: • Base pContainer • Base DistributionManager • Base Sequential ContainerInterface

  10. Base Distribution Manager • Base Distribution Manager is responsible for locating elements (finding the memory containing the element) • Each pContainer element has a unique global identifier (GID) • Local element: Processor 0 needs an element with GID = 2 Processor 0 Processor 1 • Remote element: Processor 0 needs an element with GID = 4 Processor 0 Processor 1

  11. Base Sequential Container BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • pContainer is composed of sequential containers • Base Sequential Container Interface/Part provides an uniform interface to easily build pContainers from different sequential containers

  12. Base pContainer BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • Generic methods to construct a pContainer • Methods to add, access, modify elements • Methods to efficiently locate elements

  13. Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work

  14. pArray: Introduction • An array: • a data structure with fixed (unchangeable) size • Elements can be accessed randomly using their index • array[5] = 45; • Arrays are useful for numerically intensive applications • In C++ there is no fixed sized array • C++ vector allows insertion and deletion of elements in the middle and thus is hard to optimize • We have designed a pArray for STAPL for this purpose

  15. pArray: Basic Design BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • pArray is derived from the base classes of the pContainer • Three Major Components: • Array Part • Array Distribution • pArray

  16. Array Distribution • Responsible for locating local and remote elements • Two ways this can be done: • Duplicated Distribution Information • Each processor has information about where all the elements are • Decentralized Distribution Information • Each processor is responsible for keeping track of the location of an evenly divided amount of elements

  17. Duplicated Distribution Information • Array Distribution information is stored in a vector of pair < <Start_Index, Size>, Processor ID> • Each processor has a copy of the Distribution vector • Lookup Process: Look in the Distribution Vector • Check if GID is in the range Processor 0 Processor 1 Data Data Distribution Vector (Start_ Index, Size):PID Distribution Vector (Start_ Index, Size):PID

  18. Decentralized Distribution Information • Evenly divide the array into segments • Each processor is responsible for knowing the location of one segment Example: Processor 0 needs element with GID = 5 The algorithm: Processor 0 Processor 1 Cache Locally Lookup GID • Data • Data Is Local? Get location information from Map Owner yes • Distribution • Distribution no Location Cache Location Cache 5*2/8=1 5 Is in Cache? yes MapOwner = GID*nprocs/n 1 Location Map Location Map no

  19. Duplicated Distribution Information vs. Decentralized Distribution Information

  20. Array Part BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • As a wrapper over the sequential STL container valarray • Has all of functionality of the valarray

  21. pArray Class BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • BasePContainer instatiates ArrayPart and ArrayDistribution • pArray class is derived from the Base pContainer to implement the functionality specific to the pArray

  22. pArray Class class pArray { //constructors pArray() //default constructor pArray(int size) //specific constructor with default distribution pArray(int size, ArrayDistribution distr) //constructor with specified distribution //element access methods Data GetElement(GID); //returns an element with specified GID void SetElement(GID,Data); //sets a specified location with the given value //operators and array specific methods Data operator[] //index array access operator pArray operator+(Data scalar) //adds a scalar to the pArray pArray operator+(pArray array) //adds term by term two pArrays of the same size (undefined otherwise); //returns an array with the same distribution as the calling array pArray operator*(Data scalar) //multiplies the pArray by a scalar pArray operator*(pArray array) //multiplies term by term two pArrays of the same size (undefined otherwise); //returns an array with the same distribution as the calling array Data accumulate() //sums up all the values stored in the pArray Data dotproduct(pArray array) //dot product of two pArrays of the same size (undefined otherwise) long double euclideannorm() //euclidean norm of an pArray }

  23. Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work

  24. Prefix Sums of a sequence S={x1, x2, … ,xn} of n elements are the n partial sums defined by Pi = x1 + x2 + … + xi, 1  i n Sequential Algorithm S[n]; //original array P[n]; //prefix sums P[0] = S[0]; for (int i=1; i<n; i++) { P[i]=P[i-1]+S[i]; } Prefix Sums • One of the most basic parallel algorithms • Used in other parallel algorithms like sorts

  25. Parallel Prefix Sums Processor 0 Processor 1 Step 1: Each processor sums up its part Part Sum = 5 Part Sum = 4 Step 2: Processor 0 receives all part sums, calculates starting sums for each processor, sends the corresponding starting sums to all processors Starting Sum = 0 Starting Sum = 5 Step 3: Each processor calculates its prefix sums

  26. Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work

  27. Performance Results • Scalability is the ability of a program to exhibit good speed-up as the number of processors used is increased • Scalability = Time running on 1 Processor/Parallel Running Time Running Prefix Sums for a pArray of 1,000,000 elements on 1 to 6 processors

  28. Performance Results pVector is a similar to pArray data structure with a dynamic size (new elements can be added and deleted at runtime) • Running Prefix Sums on 1,000,000 elements using pArray and pVector • pArray is faster due to less overhead

  29. Conclusions • pArray is a useful pContainer • pArray shows good scalability • pArray is faster than pVector in parallel Prefix Sum • Parallel Prefix Sums is an efficient pAlgorithm

  30. Future Work • Array re-distribution • Optimize Prefix Sums • More pAlgorithms

  31. References • [1] "STAPL: An Adaptive, Generic Parallel C++ Library",  Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato and Lawrence Rauchwerger,  14th Workshop on Languages and Compilers for Parallel Computing (LCPC),  Cumberland Falls, KY, August, 2001. • [2] “Efficient Parallel Containers with Shared Object View",  Ping An, Alin Jula, Gabriel Tanase, Paul Thomas, Nancy Amato and Lawrence Rauchwerger

  32. Thank you • To my mentors: • Nancy Amato • Ping An • Gabriel Tanase

More Related