320 likes | 453 Views
pArray as an Efficient Static Parallel Container in STAPL (Standard Template Adaptive Parallel Library). Presenter: Olga Tkachyshyn Grad Student Advisors: Ping An, Gabriel Tanase Faculty Advisor: Nancy Amato. Presentation Plan. Motivation STAPL Overview pContainer Design pArray
E N D
pArray as an Efficient StaticParallel ContainerinSTAPL(Standard Template Adaptive Parallel Library) Presenter: Olga Tkachyshyn Grad Student Advisors: Ping An, Gabriel Tanase Faculty Advisor: Nancy Amato
Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work
Motivation • The time it takes to complete a task is limited to the speed of the worker • Alternative: • Similarly, the time it takes to solve a problem on a computer is limited to the speed of the processor • Alternative: parallel processing or the concurrent use of multiple processors to process data
Parallel/Distributed Architecture • Multiple processors are connected together • A processor can have its own memory or share the memory with another processor Processor 0 Processor 1 Processor 2 Cache 1 Cache 2 Cache 0 Memory 1 Memory 0
Motivation • Powerful parallel computers can solve hard to compute problems • Computational physics • Protein folding • Parallel programming is challenging due to the communication and synchronization issues • Parallel libraries reduce the complexity of parallel programming
STAPL Introduction • The Parasol Lab in the Computer Science Department at TAMU is developing a Standard Template Adaptive Parallel Library (STAPL) • STAPL is designed as a platform independent parallel library • STAPL provides a collection of parallel containers (generic distributed data structures) that are efficient and easy to use
Container pContainer Runtime system that Iterator pRange optimizes performance Algorithms pAlgorithms for all platforms STL STAPL STAPL Overview • STAPL is a C++ parallel library designed as a superset of the Standard Template Library(STL). • STAPL simplifies parallel programming by letting the user ignore the distributed machine details like: • data partitioning • distribution • communication
STAPL Main Components • pContainer • Generic distributed data structure • STAPL requires an efficient array data structure for numeric intensive applications • pRange • Presents an abstract view of a scoped data space, which allows random access to a partition or subrange of the data in a pContainer • pAlgorithms • Parallel Algorithms which provide basic functionality, bound with the pContainer by pRange
pContainer: Basic Design BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • pContainers are data structures that allow users to store and use distributed data as if it is stored in a single memory • All pContainers have similar functionality • pContainers have three basic components: • Base pContainer • Base DistributionManager • Base Sequential ContainerInterface
Base Distribution Manager • Base Distribution Manager is responsible for locating elements (finding the memory containing the element) • Each pContainer element has a unique global identifier (GID) • Local element: Processor 0 needs an element with GID = 2 Processor 0 Processor 1 • Remote element: Processor 0 needs an element with GID = 4 Processor 0 Processor 1
Base Sequential Container BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • pContainer is composed of sequential containers • Base Sequential Container Interface/Part provides an uniform interface to easily build pContainers from different sequential containers
Base pContainer BaseDistributionManager BaseSequentialContainer SpecificDistributionManager SpecificSequentialContainer -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers SpecificPContainer • Generic methods to construct a pContainer • Methods to add, access, modify elements • Methods to efficiently locate elements
Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work
pArray: Introduction • An array: • a data structure with fixed (unchangeable) size • Elements can be accessed randomly using their index • array[5] = 45; • Arrays are useful for numerically intensive applications • In C++ there is no fixed sized array • C++ vector allows insertion and deletion of elements in the middle and thus is hard to optimize • We have designed a pArray for STAPL for this purpose
pArray: Basic Design BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • pArray is derived from the base classes of the pContainer • Three Major Components: • Array Part • Array Distribution • pArray
Array Distribution • Responsible for locating local and remote elements • Two ways this can be done: • Duplicated Distribution Information • Each processor has information about where all the elements are • Decentralized Distribution Information • Each processor is responsible for keeping track of the location of an evenly divided amount of elements
Duplicated Distribution Information • Array Distribution information is stored in a vector of pair < <Start_Index, Size>, Processor ID> • Each processor has a copy of the Distribution vector • Lookup Process: Look in the Distribution Vector • Check if GID is in the range Processor 0 Processor 1 Data Data Distribution Vector (Start_ Index, Size):PID Distribution Vector (Start_ Index, Size):PID
Decentralized Distribution Information • Evenly divide the array into segments • Each processor is responsible for knowing the location of one segment Example: Processor 0 needs element with GID = 5 The algorithm: Processor 0 Processor 1 Cache Locally Lookup GID • Data • Data Is Local? Get location information from Map Owner yes • Distribution • Distribution no Location Cache Location Cache 5*2/8=1 5 Is in Cache? yes MapOwner = GID*nprocs/n 1 Location Map Location Map no
Duplicated Distribution Information vs. Decentralized Distribution Information
Array Part BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • As a wrapper over the sequential STL container valarray • Has all of functionality of the valarray
pArray Class BaseDistributionManager BaseSequentialContainer Array Distribution Array Part -2 * -4 * -3 1 -1 1 Distribution Manager, Sequential Container BasePContainer -distribution manager -collection of sequential containers pArray • BasePContainer instatiates ArrayPart and ArrayDistribution • pArray class is derived from the Base pContainer to implement the functionality specific to the pArray
pArray Class class pArray { //constructors pArray() //default constructor pArray(int size) //specific constructor with default distribution pArray(int size, ArrayDistribution distr) //constructor with specified distribution //element access methods Data GetElement(GID); //returns an element with specified GID void SetElement(GID,Data); //sets a specified location with the given value //operators and array specific methods Data operator[] //index array access operator pArray operator+(Data scalar) //adds a scalar to the pArray pArray operator+(pArray array) //adds term by term two pArrays of the same size (undefined otherwise); //returns an array with the same distribution as the calling array pArray operator*(Data scalar) //multiplies the pArray by a scalar pArray operator*(pArray array) //multiplies term by term two pArrays of the same size (undefined otherwise); //returns an array with the same distribution as the calling array Data accumulate() //sums up all the values stored in the pArray Data dotproduct(pArray array) //dot product of two pArrays of the same size (undefined otherwise) long double euclideannorm() //euclidean norm of an pArray }
Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work
Prefix Sums of a sequence S={x1, x2, … ,xn} of n elements are the n partial sums defined by Pi = x1 + x2 + … + xi, 1 i n Sequential Algorithm S[n]; //original array P[n]; //prefix sums P[0] = S[0]; for (int i=1; i<n; i++) { P[i]=P[i-1]+S[i]; } Prefix Sums • One of the most basic parallel algorithms • Used in other parallel algorithms like sorts
Parallel Prefix Sums Processor 0 Processor 1 Step 1: Each processor sums up its part Part Sum = 5 Part Sum = 4 Step 2: Processor 0 receives all part sums, calculates starting sums for each processor, sends the corresponding starting sums to all processors Starting Sum = 0 Starting Sum = 5 Step 3: Each processor calculates its prefix sums
Presentation Plan • Motivation • STAPL Overview • pContainer Design • pArray • Prefix-sum using pArray and pVector • Performance results • Conclusion and Future Work
Performance Results • Scalability is the ability of a program to exhibit good speed-up as the number of processors used is increased • Scalability = Time running on 1 Processor/Parallel Running Time Running Prefix Sums for a pArray of 1,000,000 elements on 1 to 6 processors
Performance Results pVector is a similar to pArray data structure with a dynamic size (new elements can be added and deleted at runtime) • Running Prefix Sums on 1,000,000 elements using pArray and pVector • pArray is faster due to less overhead
Conclusions • pArray is a useful pContainer • pArray shows good scalability • pArray is faster than pVector in parallel Prefix Sum • Parallel Prefix Sums is an efficient pAlgorithm
Future Work • Array re-distribution • Optimize Prefix Sums • More pAlgorithms
References • [1] "STAPL: An Adaptive, Generic Parallel C++ Library", Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato and Lawrence Rauchwerger, 14th Workshop on Languages and Compilers for Parallel Computing (LCPC), Cumberland Falls, KY, August, 2001. • [2] “Efficient Parallel Containers with Shared Object View", Ping An, Alin Jula, Gabriel Tanase, Paul Thomas, Nancy Amato and Lawrence Rauchwerger
Thank you • To my mentors: • Nancy Amato • Ping An • Gabriel Tanase