1 / 34

A Performance Model for Fine-Grain Accesses in UPC

Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University {zhazhang,steve}@mtu.edu http://www.upc.mtu.edu. A Performance Model for Fine-Grain Accesses in UPC. Outline. Motivation and approach The UPC programming model Performance model design

jui
Download Presentation

A Performance Model for Fine-Grain Accesses in UPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University {zhazhang,steve}@mtu.edu http://www.upc.mtu.edu A Performance Model forFine-Grain Accesses in UPC

  2. Outline • Motivation and approach • The UPC programming model • Performance model design • Microbenchmarks • Application analysis • Measurements • Summary and continuing work

  3. 1. Motivation • Unified Parallel C (UPC) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming. • UPC compilers are available for platforms ranging from MACs to the Cray X1. • The Partitioned Global Address Space (PGAS) community is asking for performance models. • An accurate model • determines if an application code takes best advantage of the underlying parallel system and • identifies code and system optimizations that are needed.

  4. Approach • Construct an application-level analytical performance model. • Model fine-grain access performance based on platform benchmarks and code analysis. • Platform benchmarks determine compiler and runtime system optimization abilities. • Code analysis determines where these optimizations will be applied in the code.

  5. 2. The UPC programming model • UPC is an extension of ISO C99. • UPC processes are called threads: predefined identifiers THREADS and MYTHREAD are provided. • UPC is based on a partitioned shared memory model: A single address space that is logically partitioned among processors. Partitioned memory is part of the programming paradigm. Physical memory may or may not be partitioned.

  6. UPC’s partitionedshared address space • Each thread has a private (local) address space. • All threads share a global address space that is partitioned among the threads. • A shared object in thread i’s region of the partition is said to have affinity to thread i. • If thread i has affinity to a shared object x, it is likely that accesses to x take less time than accesses to shared objects to which thread i does not have affinity. • A performance model must capture this property.

  7. int i; shared [5] int A[10*THREADS]; 0 5 10 9 7 shared 15 20 25 i i i 3 private i=3; A[0]=7; A[i]=A[0]+2; th0 th1 th2 UPC programming model

  8. 3. Performance model design • Platform abstraction • identify potential optimizations • microbenchmarks measure platform properties with respect to those optimizations • Application analysis • code is partitioned by sequence points: barriers, fences, strict memory accesses, library calls. • characterize patterns of shared memory accesses

  9. Platform abstraction • UPC compilers and runtime systems try to avoid and/or reduce the latency of remote accesses. • exploit spatial locality • overlap remote accesses with other work • Each platform applies a different set of optimization techniques. • The model must capture the effects of those optimizations in the presence of some uncertainty about how they are actually applied.

  10. Potential optimizations • access aggregation: multiple accesses to shared memory that have the same affinity and can be postponed and combined • access vectorization: a special case of aggregation where the stride is regular • runtime caching: exploits temporal and spatial reuse • access pipelining: overlap independent concurrent remote accesses to multiple threads if the network allows

  11. Potential optimizations (cont’d) • communication/computation overlapping: the usual technique applied by experienced programmers • multistreaming: provided by hardware with a memory system that can handle multiple streams of data • Notes: • The effects of these optimizations are not disjoint, e.g., caching and coalescing can have similar effects. • It can be difficult to determine with certainty which optimizations are actually at work. • Microbenchmarks associated with the performance model try to reveal available optimizations.

  12. 4. Microbenchmarks identifyavailable optimizations • Baseline: cost of random remote shared accesses • when no optimizations can be applied • Vector: cost of accesses to consecutive remote locations • captures vectorization and runtime caching • Coalesce: random, small-strided remote accesses • captures pipelining and aggregation • Local vs. private: accesses to local (shared) memory • captures overhead of shared memory addressing • Costs are expressed in units of double words/sec.

  13. 5. Application analysis overview • Application code is partitioned into intervals based on sequence points such as barriers, fences, strict memory accesses, etc. • A dependence graph is constructed for all accesses to the same shared object (i.e., array) in each interval. • References are partitioned into groups based on the four types of benchmarks. These references are amenable to the associated optimizations. • Costs are accumulated to obtain a performance prediction.

  14. A reference partition • A partition (C, pattern, name) is a set of accesses that occur in a synchronization interval, where • C is the set of accesses, • pattern is in {baseline, vector, coalesce, local}, and • name is the accessed object, e.g., shared array A. • User defined functions are inlined to obtain flat code. • Alias analysis must be done. • Recursion is not modeled.

  15. Dependence analysis • The reference partitioning graph G’=(V’,E’) is constructed from G. The goal is to determine sets of accesses that can be done concurrently. • Dependencies considered are true dependence, antidependence, output dependence and input dependence. • The dependence graph G=(V,E) of an interval consists of one vertex for each reference to a shared object (its name) and edges connect dependent vertices. • The reference partitioning graph G’=(V’,E’) is constructed from G.

  16. Reference partitioning graph • The reference partitioning graph G’=(V’,E’) is constructed from G. • Let B be a subset of E consisting of edges denoting true and antidependences. • Construct V’ by grouping vertices in V that • have the same name, • reference memory locations with the same affinity, • are not connected by an edge in B. • Each vertex in V’ is assigned a reference pattern.

  17. Example 1 shared [] float *A; // A points to a remote block of shared memory for (i=1; i<N; i++) { ... = A[i]; ... = A[i-1]; ... } • A[i] and A[i-1] are in the same partition. • If the platform supports vectorization then this pattern is assigned the vector type. • If not, each pair of accesses can be coalesced on each iteration. • If not, the baseline pattern is assigned to this partition.

  18. Example 2 shared [1] float *B; // B is distributed one element per thread // (round robin) for (i=1; i<N; i++) { ... = B[i]; ... = B[i-1]; ... } • B[i] and B[i-1] are in different partitions. • Vectorization and coalescence cannot be applied • The pattern is mixed baseline-local, e.g. if THREADS=4, then the mix is 75%-25%. • For large numbers of threads the pattern is just baseline.

  19. Communication cost • The communication cost of interval i is where Njis the number of shared memory accesses in partition j and r(Nj ,pattern) is the access rate for that number of references and that pattern. • The functions r(Nj ,pattern) are determined by microbenchmarking.

  20. Computation cost • Computation cost is measured by simulating the computation using only private memory accesses. • The total run time of each interval i is • The cost of barriers is separately measured. • The predicted cost for each thread is the sum of all of these costs. • The highest predicted cost among all threads is taken to be the total cost.

  21. Speedup prediction • Speedup can be estimated by the ratio of the number of accesses in the sequential code to the weighted sum of the number of remote accesses of each type. • Details are given in the paper.

  22. 6. Measurements • Microbenchmarks • Applications • Histogramming • Matrix multiply • Sobel edge detection • Platforms measured • 16-node 2 GHz x86 Linux Myrinet cluster • MuPC V1.1.2 beta • Berkeley UPC C2.2 • 48-node 300 Mhz Cray T3E • GCC UPC

  23. Prediction precision • Execution time prediction precision is expressed as • A negative value indicates that the cost is overestimated.

  24. Microbenchmark measurements • A few observations • Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC. • Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write. • Berkeley successfully coalesces reads. • GCC is unusually slow at local writes.

  25. Microbenchmark measurements • Increasing the number of threads from 2 to 12 decreases performance in all cases: the decrease ranges from 0-10% on the T3E to as high as 25% for Berkeley and 50% in one case(*) for MuPC. • Caching improves MuPC performance for vector and coalesce and reduces performance for (*)baseline write. • Berkeley successfully coalesces reads. • GCC is unusually slow at local writes.

  26. Histogramming shared [1] int array[N] for (i=0; i<N*percentage; i++) { loc = random_index(i); array[loc]++; } upc_barrier; • Random elements of an array are updated in parallel. • Races are ignored. • percentagedetermines how much of the table is accessed. • Collisions grow as percentage gets smaller. • This fits a mixed baseline-local pattern when the number of threads is small, as explained earlier.

  27. Histogramming performance estimates • For 12 threads the predicted cost is usually within 5%.

  28. Matrix multiply upc_forall (i=0; i<N; i++; &A[i][0]) { for (j=0; j<N; j++) { C[i][j] = 0.0; for (k=0; k<N; k++) C[i][j] += A[i][k]*B[k][j]; }} • Thread t executes pass i if A[i][0] has affinity to t. • Remote accesses are minimized by distributing rows of A and C across threads while columns of B are distributed across threads. • Both cyclic striped and block striped distributions are measured. • Accesses to A and C are local; accesses to B are mixed vector-local.

  29. Matrix multiply performance estimates • N x N = 240 x 240 • Berkeley and GCC costs are underestimated. • MuPC/w cache costs are overestimated because temporal locality is not modeled.

  30. Sobel edge detection • A 2000 x 2000-pixel image is distributed so that each thread gets approx. 2000/THREADS contiguous rows. • All accesses to the computed image are local. • Read-only accesses to the source array are mixed-local. The north and south border rows are in neighboring threads, all other rows are local. • Source array access patterns are • local-vector on MuPC with cache, • local-coalesce on Berkeley because it coalesces. • local-baseline on GCC because it does not optimize.

  31. Sobel performance estimates • Precision is worst for MuPC because of unaccounted- for cache overhead for 2 threads and because the vector pattern only approximates cache behavior for larger numbers of threads.

  32. 7. Summary and continuing work • This is a first attempt at a model for a PGAS language. • The model identifies potential optimizations that a platform may provide and offers a set of microbenchmarks to capture their effects. • Code analysis identifies the access patterns in use and matches them with available platform optimizations. • Performance predictions for simple codes are usually within 15% of actual run times and most of the time they are better than that.

  33. Improvements • Model coarse-grain shared memory operations such as upc_memcpy(). • Model the overlap between memory accesses and computation. • Model contention for access to shared memory. • Model computational load imbalance. • Apply the model to larger applications and make measurements for larger numbers of threads. • Explore how the model can be automated.

  34. Questions?

More Related