pArray as an Efficient Static Parallel Container in STAPL (Standard Template Adaptive Parallel Library)

pArray as an Efficient StaticParallel ContainerinSTAPL(Standard Template Adaptive Parallel Library) Presenter: Olga Tkachyshyn Grad Student Advisors: Ping An, Gabriel Tanase Faculty Advisor: Nancy Amato

Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work

Motivation • The time it takes to complete a task is limited to the speed of the worker • Alternative: • Similarly, the time it takes to solve a problem on a computer is limited to the speed of the processor • Alternative: parallel processing or the concurrent use of multiple processors to process data

Parallel/Distributed Architecture • Multiple processors are connected together • A processor can have its own memory or share the memory with another processor Processor 0 Processor 1 Processor 2 Cache 1 Cache 2 Cache 0 Memory 1 Memory 0

Motivation • Powerful parallel computers can solve hard to compute problems • Computational physics • Protein folding • Parallel programming is challenging due to the communication and synchronization issues • Parallel libraries reduce the complexity of parallel programming

STAPL Introduction • The Parasol Lab in the Computer Science Department at TAMU is developing a Standard Template Adaptive Parallel Library (STAPL) • STAPL is designed as a platform independent parallel library • STAPL provides a collection of parallel containers (generic distributed data structures) that are efficient and easy to use

Container pContainer Runtime system that Iterator pRange optimizes performance Algorithms pAlgorithms for all platforms STL STAPL STAPL Overview • STAPL is a C++ parallel library designed as a superset of the Standard Template Library(STL). • STAPL simplifies parallel programming by letting the user ignore the distributed machine details like: • data partitioning • distribution • communication

STAPL Main Components • pContainer • Generic distributed data structure • STAPL requires an efficient array data structure for numeric intensive applications • pRange • Presents an abstract view of a scoped data space, which allows random access to a partition or subrange of the data in a pContainer • pAlgorithms • Parallel Algorithms which provide basic functionality, bound with the pContainer by pRange

pContainer: Basic Design BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • pContainers are data structures that allow users to store and use distributed data as if it is stored in a single memory • All pContainers have similar functionality • pContainers have three basic components: • Base pContainer • Base DistributionManager • Base Sequential ContainerInterface

Base Distribution Manager • Base Distribution Manager is responsible for locating elements (finding the memory containing the element) • Each pContainer element has a unique global identifier (GID) • Local element: Processor 0 needs an element with GID = 2 Processor 0 Processor 1 • Remote element: Processor 0 needs an element with GID = 4 Processor 0 Processor 1

Base Sequential Container BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • pContainer is composed of sequential containers • Base Sequential Container Interface/Part provides an uniform interface to easily build pContainers from different sequential containers

Base pContainer BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • Generic methods to construct a pContainer • Methods to add, access, modify elements • Methods to efficiently locate elements

pArray: Introduction • An array: • a data structure with fixed (unchangeable) size • Elements can be accessed randomly using their index • array[5] = 45; • Arrays are useful for numerically intensive applications • In C++ there is no fixed sized array • C++ vector allows insertion and deletion of elements in the middle and thus is hard to optimize • We have designed a pArray for STAPL for this purpose

pArray: Basic Design BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • pArray is derived from the base classes of the pContainer • Three Major Components: • Array Part • Array Distribution • pArray

Array Distribution • Responsible for locating local and remote elements • Two ways this can be done: • Duplicated Distribution Information • Each processor has information about where all the elements are • Decentralized Distribution Information • Each processor is responsible for keeping track of the location of an evenly divided amount of elements

Duplicated Distribution Information • Array Distribution information is stored in a vector of pair < <Start_Index, Size>, Processor ID> • Each processor has a copy of the Distribution vector • Lookup Process: Look in the Distribution Vector • Check if GID is in the range Processor 0 Processor 1 Data Data Distribution Vector (Start_ Index, Size):PID Distribution Vector (Start_ Index, Size):PID

Decentralized Distribution Information • Evenly divide the array into segments • Each processor is responsible for knowing the location of one segment Example: Processor 0 needs element with GID = 5 The algorithm: Processor 0 Processor 1 Cache Locally Lookup GID • Data • Data Is Local? Get location information from Map Owner yes • Distribution • Distribution no Location Cache Location Cache 5*2/8=1 5 Is in Cache? yes MapOwner = GID*nprocs/n 1 Location Map Location Map no

Duplicated Distribution Information vs. Decentralized Distribution Information

Array Part BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • As a wrapper over the sequential STL container valarray • Has all of functionality of the valarray

pArray Class BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • BasePContainer instatiates ArrayPart and ArrayDistribution • pArray class is derived from the Base pContainer to implement the functionality specific to the pArray

pArray Class class pArray { //constructors pArray() //default constructor pArray(int size) //specific constructor with default distribution pArray(int size, ArrayDistribution distr) //constructor with specified distribution //element access methods Data GetElement(GID); //returns an element with specified GID void SetElement(GID,Data); //sets a specified location with the given value //operators and array specific methods Data operator[] //index array access operator pArray operator+(Data scalar) //adds a scalar to the pArray pArray operator+(pArray array) //adds term by term two pArrays of the same size (undefined otherwise); //returns an array with the same distribution as the calling array pArray operator*(Data scalar) //multiplies the pArray by a scalar pArray operator*(pArray array) //multiplies term by term two pArrays of the same size (undefined otherwise); //returns an array with the same distribution as the calling array Data accumulate() //sums up all the values stored in the pArray Data dotproduct(pArray array) //dot product of two pArrays of the same size (undefined otherwise) long double euclideannorm() //euclidean norm of an pArray }

Prefix Sums of a sequence S={x1, x2, … ,xn} of n elements are the n partial sums defined by Pi = x1 + x2 + … + xi, 1  i n Sequential Algorithm S[n]; //original array P[n]; //prefix sums P[0] = S[0]; for (int i=1; i<n; i++) { P[i]=P[i-1]+S[i]; } Prefix Sums • One of the most basic parallel algorithms • Used in other parallel algorithms like sorts

Parallel Prefix Sums Processor 0 Processor 1 Step 1: Each processor sums up its part Part Sum = 5 Part Sum = 4 Step 2: Processor 0 receives all part sums, calculates starting sums for each processor, sends the corresponding starting sums to all processors Starting Sum = 0 Starting Sum = 5 Step 3: Each processor calculates its prefix sums

Performance Results • Scalability is the ability of a program to exhibit good speed-up as the number of processors used is increased • Scalability = Time running on 1 Processor/Parallel Running Time Running Prefix Sums for a pArray of 1,000,000 elements on 1 to 6 processors

Performance Results pVector is a similar to pArray data structure with a dynamic size (new elements can be added and deleted at runtime) • Running Prefix Sums on 1,000,000 elements using pArray and pVector • pArray is faster due to less overhead

Conclusions • pArray is a useful pContainer • pArray shows good scalability • pArray is faster than pVector in parallel Prefix Sum • Parallel Prefix Sums is an efficient pAlgorithm

Future Work • Array re-distribution • Optimize Prefix Sums • More pAlgorithms

References • [1] "STAPL: An Adaptive, Generic Parallel C++ Library", Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato and Lawrence Rauchwerger, 14th Workshop on Languages and Compilers for Parallel Computing (LCPC), Cumberland Falls, KY, August, 2001. • [2] “Efficient Parallel Containers with Shared Object View", Ping An, Alin Jula, Gabriel Tanase, Paul Thomas, Nancy Amato and Lawrence Rauchwerger

Thank you • To my mentors: • Nancy Amato • Ping An • Gabriel Tanase

pArray as an Efficient Static Parallel Container in STAPL (Standard Template Adaptive Parallel Library)

pArray as an Efficient Static Parallel Container in STAPL (Standard Template Adaptive Parallel Library)

Presentation Transcript

Task Parallel Library

SEG4110 – Advanced Software Design and Reengineering

An Introduction to the Thrust Parallel Algorithms Library

Parallel Programming with OpenMP

Standard Template Library

Parallel Programming with OpenMP

Chapter 4: Stacks and Queues

Programming Interest Group comp.hkbu.hk/~chxw/pig/index.htm

Parallel Programming

STAPL The C++ Standard Template Adaptive Parallel Library

Adaptive Parallel Applications in Distributed Computing Environment

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

An Evaluation of Partitioners for Parallel SAMR Applications

Standard Template Library

Standard Template Library STL

Parallel Port

Chapter 17: The Standard Template Library

C++ Programming

Standard Template Library

Standard Template Library