340 likes | 447 Views
Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University {zhazhang,steve}@mtu.edu http://www.upc.mtu.edu. A Performance Model for Fine-Grain Accesses in UPC. Outline. Motivation and approach The UPC programming model Performance model design
E N D
Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University {zhazhang,steve}@mtu.edu http://www.upc.mtu.edu A Performance Model forFine-Grain Accesses in UPC
Outline • Motivation and approach • The UPC programming model • Performance model design • Microbenchmarks • Application analysis • Measurements • Summary and continuing work
1. Motivation • Unified Parallel C (UPC) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming. • UPC compilers are available for platforms ranging from MACs to the Cray X1. • The Partitioned Global Address Space (PGAS) community is asking for performance models. • An accurate model • determines if an application code takes best advantage of the underlying parallel system and • identifies code and system optimizations that are needed.
Approach • Construct an application-level analytical performance model. • Model fine-grain access performance based on platform benchmarks and code analysis. • Platform benchmarks determine compiler and runtime system optimization abilities. • Code analysis determines where these optimizations will be applied in the code.
2. The UPC programming model • UPC is an extension of ISO C99. • UPC processes are called threads: predefined identifiers THREADS and MYTHREAD are provided. • UPC is based on a partitioned shared memory model: A single address space that is logically partitioned among processors. Partitioned memory is part of the programming paradigm. Physical memory may or may not be partitioned.
UPC’s partitionedshared address space • Each thread has a private (local) address space. • All threads share a global address space that is partitioned among the threads. • A shared object in thread i’s region of the partition is said to have affinity to thread i. • If thread i has affinity to a shared object x, it is likely that accesses to x take less time than accesses to shared objects to which thread i does not have affinity. • A performance model must capture this property.
int i; shared [5] int A[10*THREADS]; 0 5 10 9 7 shared 15 20 25 i i i 3 private i=3; A[0]=7; A[i]=A[0]+2; th0 th1 th2 UPC programming model
3. Performance model design • Platform abstraction • identify potential optimizations • microbenchmarks measure platform properties with respect to those optimizations • Application analysis • code is partitioned by sequence points: barriers, fences, strict memory accesses, library calls. • characterize patterns of shared memory accesses
Platform abstraction • UPC compilers and runtime systems try to avoid and/or reduce the latency of remote accesses. • exploit spatial locality • overlap remote accesses with other work • Each platform applies a different set of optimization techniques. • The model must capture the effects of those optimizations in the presence of some uncertainty about how they are actually applied.
Potential optimizations • access aggregation: multiple accesses to shared memory that have the same affinity and can be postponed and combined • access vectorization: a special case of aggregation where the stride is regular • runtime caching: exploits temporal and spatial reuse • access pipelining: overlap independent concurrent remote accesses to multiple threads if the network allows
Potential optimizations (cont’d) • communication/computation overlapping: the usual technique applied by experienced programmers • multistreaming: provided by hardware with a memory system that can handle multiple streams of data • Notes: • The effects of these optimizations are not disjoint, e.g., caching and coalescing can have similar effects. • It can be difficult to determine with certainty which optimizations are actually at work. • Microbenchmarks associated with the performance model try to reveal available optimizations.
4. Microbenchmarks identifyavailable optimizations • Baseline: cost of random remote shared accesses • when no optimizations can be applied • Vector: cost of accesses to consecutive remote locations • captures vectorization and runtime caching • Coalesce: random, small-strided remote accesses • captures pipelining and aggregation • Local vs. private: accesses to local (shared) memory • captures overhead of shared memory addressing • Costs are expressed in units of double words/sec.
5. Application analysis overview • Application code is partitioned into intervals based on sequence points such as barriers, fences, strict memory accesses, etc. • A dependence graph is constructed for all accesses to the same shared object (i.e., array) in each interval. • References are partitioned into groups based on the four types of benchmarks. These references are amenable to the associated optimizations. • Costs are accumulated to obtain a performance prediction.
A reference partition • A partition (C, pattern, name) is a set of accesses that occur in a synchronization interval, where • C is the set of accesses, • pattern is in {baseline, vector, coalesce, local}, and • name is the accessed object, e.g., shared array A. • User defined functions are inlined to obtain flat code. • Alias analysis must be done. • Recursion is not modeled.
Dependence analysis • The reference partitioning graph G’=(V’,E’) is constructed from G. The goal is to determine sets of accesses that can be done concurrently. • Dependencies considered are true dependence, antidependence, output dependence and input dependence. • The dependence graph G=(V,E) of an interval consists of one vertex for each reference to a shared object (its name) and edges connect dependent vertices. • The reference partitioning graph G’=(V’,E’) is constructed from G.
Reference partitioning graph • The reference partitioning graph G’=(V’,E’) is constructed from G. • Let B be a subset of E consisting of edges denoting true and antidependences. • Construct V’ by grouping vertices in V that • have the same name, • reference memory locations with the same affinity, • are not connected by an edge in B. • Each vertex in V’ is assigned a reference pattern.
Example 1 shared [] float *A; // A points to a remote block of shared memory for (i=1; i<N; i++) { ... = A[i]; ... = A[i-1]; ... } • A[i] and A[i-1] are in the same partition. • If the platform supports vectorization then this pattern is assigned the vector type. • If not, each pair of accesses can be coalesced on each iteration. • If not, the baseline pattern is assigned to this partition.
Example 2 shared [1] float *B; // B is distributed one element per thread // (round robin) for (i=1; i<N; i++) { ... = B[i]; ... = B[i-1]; ... } • B[i] and B[i-1] are in different partitions. • Vectorization and coalescence cannot be applied • The pattern is mixed baseline-local, e.g. if THREADS=4, then the mix is 75%-25%. • For large numbers of threads the pattern is just baseline.
Communication cost • The communication cost of interval i is where Njis the number of shared memory accesses in partition j and r(Nj ,pattern) is the access rate for that number of references and that pattern. • The functions r(Nj ,pattern) are determined by microbenchmarking.
Computation cost • Computation cost is measured by simulating the computation using only private memory accesses. • The total run time of each interval i is • The cost of barriers is separately measured. • The predicted cost for each thread is the sum of all of these costs. • The highest predicted cost among all threads is taken to be the total cost.
Speedup prediction • Speedup can be estimated by the ratio of the number of accesses in the sequential code to the weighted sum of the number of remote accesses of each type. • Details are given in the paper.
6. Measurements • Microbenchmarks • Applications • Histogramming • Matrix multiply • Sobel edge detection • Platforms measured • 16-node 2 GHz x86 Linux Myrinet cluster • MuPC V1.1.2 beta • Berkeley UPC C2.2 • 48-node 300 Mhz Cray T3E • GCC UPC
Prediction precision • Execution time prediction precision is expressed as • A negative value indicates that the cost is overestimated.
Microbenchmark measurements • A few observations • Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC. • Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write. • Berkeley successfully coalesces reads. • GCC is unusually slow at local writes.
Microbenchmark measurements • Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC. • Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write. • Berkeley successfully coalesces reads. • GCC is unusually slow at local writes.
Histogramming shared [1] int array[N] for (i=0; i<N*percentage; i++) { loc = random_index(i); array[loc]++; } upc_barrier; • Random elements of an array are updated in parallel. • Races are ignored. • percentagedetermines how much of the table is accessed. • Collisions grow as percentage gets smaller. • This fits a mixed baseline-local pattern when the number of threads is small, as explained earlier.
Histogramming performance estimates • For 12 threads the predicted cost is usually within 5%.
Matrix multiply upc_forall (i=0; i<N; i++; &A[i][0]) { for (j=0; j<N; j++) { C[i][j] = 0.0; for (k=0; k<N; k++) C[i][j] += A[i][k]*B[k][j]; }} • Thread t executes pass i if A[i][0] has affinity to t. • Remote accesses are minimized by distributing rows of A and C across threads while columns of B are distributed across threads. • Both cyclic striped and block striped distributions are measured. • Accesses to A and C are local; accesses to B are mixed vector-local.
Matrix multiply performance estimates • N x N = 240 x 240 • Berkeley and GCC costs are underestimated. • MuPC/w cache costs are overestimated because temporal locality is not modeled.
Sobel edge detection • A 2000 x 2000-pixel image is distributed so that each thread gets approx. 2000/THREADS contiguous rows. • All accesses to the computed image are local. • Read-only accesses to the source array are mixed-local. The north and south border rows are in neighboring threads, all other rows are local. • Source array access patterns are • local-vector on MuPC with cache, • local-coalesce on Berkeley because it coalesces. • local-baseline on GCC because it does not optimize.
Sobel performance estimates • Precision is worst for MuPC because of unaccounted- for cache overhead for 2 threads and because the vector pattern only approximates cache behavior for larger numbers of threads.
7. Summary and continuing work • This is a first attempt at a model for a PGAS language. • The model identifies potential optimizations that a platform may provide and offers a set of microbenchmarks to capture their effects. • Code analysis identifies the access patterns in use and matches them with available platform optimizations. • Performance predictions for simple codes are usually within 15% of actual run times and most of the time they are better than that.
Improvements • Model coarse-grain shared memory operations such as upc_memcpy(). • Model the overlap between memory accesses and computation. • Model contention for access to shared memory. • Model computational load imbalance. • Apply the model to larger applications and make measurements for larger numbers of threads. • Explore how the model can be automated.