Towards Optimized UPC Implementations

Towards Optimized UPC Implementations Tarek A. El-Ghazawi The George Washington Universitytarek@gwu.edu

Agenda • Background • UPC Language Overview • Productivity • Performance Issues • Automatic Optimizations • Conclusions

Parallel Programming Models • What is a programming model? • An abstract machine which outlines the view perceived by the programmer of data and execution • Where architecture and applications meet • A non-binding contract between the programmer and the compiler/system • Good Programming Models Should • Allow efficient mapping on different architectures • Keep programming easy • Benefits • Application - independence from architecture • Architecture - independence from applications

Programming Models Process/Thread Address Space Message PassingShared Memory DSM/PGAS MPI OpenMP UPC

Programming Paradigms Expressivity LOCALITY Implicit Explicit PARALLEISM Implicit Sequential (e.g. C, Fortran, Java) Data Parallel (e.g. HPF, C*) Shared Memory (e.g. OpenMP) Distributed Shared Memory/PGAS (e.g. UPC, CAF, and Titanium) Explicit

What is UPC? • Unified Parallel C • An explicit parallel extension of ISO C • A distributed shared memory/PGAS parallel programming language

Why not message passing? • Performance • High-penalty for short transactions • Cost of calls • Two sided • Excessive buffering • Ease-of-use • Explicit data transfers • Domain decomposition does not maintain the original global application view • More code and conceptual difficulty

Why DSM/PGAS? • Performance • No calls • Efficient short transfers • locality • Ease-of-use • Implicit transfers • Consistent global application view • Less code and conceptual difficulty

Why DSM/PGAS:New Opportunities for Compiler Optimizations Image Sobel Operator Thread0 • DSM P_Model exposes sequential remote accesses at compile time • Opportunity for compiler directed prefetching Thread1 Ghost Zones Thread2 Thread3

History • Initial Tech. Report from IDA in collaboration with LLNL and UCB in May 1999 • UPC consortium of government, academia, and HPC vendors coordinated by GWU, IDA, and DoD • The participants currently are: IDA CCS, GWU, UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid, Etnus, …

Status • Specification v1.0 completed February of 2001, v1.1.1 in October of 2003, v1.2 will add collectives and UPC/IO • Benchmarking Suites: Stream, GUPS, RandomAccess, NPB suite, Splash-2, and others • Testing suite v1.0, v1.1 • Short courses and tutorials in the US and abroad • Research Exhibits at SC 2000-2004 • UPC web site: upc.gwu.edu • UPC Book by mid 2005 from John Wiley and Sons • Manual(s)

Hardware Platforms • UPC implementations are available for • SGI O 2000/3000 • Intrepid – 32 and 64b GCC • UCB – 32 b GCC • Cray T3D/E • Cray X-1 • HP AlphaServer SC, Superdome • UPC Berkeley Compiler: Myrinet, Quadrics, and Infiniband Clusters • Beowulf Reference Implementation (MPI-based, MTU) • New ongoing efforts by IBM and Sun

UPC Execution Model • A number of threads working independently in a SPMD fashion • MYTHREAD specifies thread index (0..THREADS-1) • Number of threads specified at compile-time or run-time • Process and Data Synchronization when needed • Barriers and split phase barriers • Locks and arrays of locks • Fence • Memory consistency control

UPC Memory Model Thread THREADS-1 Thread 0 Thread 1 • Shared space with thread affinity, plus private spaces • A pointer-to-shared can reference all locations in the shared space • A private pointer may reference only addresses in its private space or addresses in its portion of the shared space • Static and dynamic memory allocations are supported for both shared and private memory Shared Private 0 Private 1 Private THREADS-1

How to declare them? int *p1; /* private pointer pointing locally */ shared int *p2; /* private pointer pointing into the shared space */ int *shared p3; /* shared pointer pointing locally */ shared int *shared p4; /* shared pointer pointing into the shared space */ You may find many using “shared pointer” to mean a pointer pointing to a shared object, e.g. equivalent to p2 but could be p4 as well. UPC Pointers

P1 P1 P1 UPC Pointers Thread 0 Shared P4 P3 P2 P2 Private P2

Synchronization - Barriers • No implicit synchronization among the threads • UPC provides the following synchronization mechanisms: • Barriers • Locks • Memory Consistency Control • Fence

Memory Consistency Models • Has to do with ordering of shared operations, and when a change of a shared object by a thread becomes visible to others • Consistency can be strict or relaxed • Under the relaxed consistency model, the shared operations can be reordered by the compiler / runtime system • The strict consistency model enforces sequential ordering of shared operations. (No operation on shared can begin before the previous ones are done, and changes become visible immediately)

Memory Consistency Models • User specifies the memory model through: • declarations • pragmas for a particular statement or sequence of statements • use of barriers, and global operations • Programmers responsible for using correct consistency model

UPC and Productivity • Metrics • Lines of ‘useful’ Code • indicates the development time as well as the maintenance cost • Number of ‘useful’ Characters • alternative way to measure development and maintenance efforts • Conceptual Complexity • function level, • keyword usage, • number of tokens, • max loop depth, • …

Manual Effort – NPB Example

Manual Effort – More Examples

Conceptual Complexity - HIST

Conceptual Complexity - GUPS

UPC Optimizations Issues • Particular Challenges • Avoiding Address Translation • Cost of Address Translation • Special Opportunities • Locality-driven compiler-directed prefetching • Aggregation • General • Low-level optimized libraries, e.g. collective • Backend optimizations • Overlapping of remote accesses and synchronization with other work

Showing Potential Optimizations Through Emulated Hand-Tunings • Different Hand-tuning levels: • Unoptimized UPC code • referred as UPC.O0 • Privatized UPC code • referred as UPC.O1 • Prefetched UPC code • hand-optimized variant using block get/put to mimic the effect of prefetching • referred as UPC.O2 • Fully Hand-Tuned UPC code • Hand-optimized variant integrating privatization, aggregation of remote accesses as well as prefetching • Referred as UPC.O3 • T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual Conference IEEE International Conference on Parallel Processing,2001 (ICPP01) Pages: 365-372

Address Translation Cost and Local Space Privatization- Cluster STREAM BENCHMARK Results gathered on a Myrinet Cluster

Address Translation and Local Space Privatization – DSM ARCHITECTURE Bulk operations Element-by-Element operations STREAM BENCHMARK MB/S

Aggregation and Overlapping of Remote Shared Memory Accesses UPC N-Queens: Execution Time UPC Sobel Edge: Execution Time • Benefit of hand-optimizations are greatly application dependent: • N-Queens does not perform any better, mainly because it is an embarrassingly parallel program • Sobel Edge Detector does get a speedup of one order of magnitude after hand-optimizating, scales linearly perfectly. • SGI O2000

Impact of Hand-Optimizations on NPB.CG Class A onSGI Origin 2k

Shared Address Translation Overhead • Address translation overhead is quite significant • More than 70% of work for a local-shared memory access • Demonstrates the real need for optimization Overhead Present in Local-Shared Memory Accesses (SGI Origin 2000, GCC-UPC) Quantification of the Address Translation Overheads

Shared Address Translation Overheads for Sobel Edge Detection UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference on Parallel Processing, Valencia, September 2001

Reducing Address Translation Overheads via Translation Look-Aside Buffers • F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast Address Translation Techniques for Distributed Shared Memory Compilers”, IPDPS’05, Denver CO, April 2005 • Use Look-up Memory Model Translation Buffers (MMTB) to perform fast translations • Two alternative methods proposed to create and use MMTB’s: • FT: basic method using direct addressing • RT: advanced method, using indexed addressing • Was prototyped as a compiler-enabled optimization • no modifications to actual UPC codes are needed

Different Strategies – Full-Table Pros • Direct mapping • No address calculation Cons • Large memory required • Can lead to competition over caches and main memory Consider shared [B] int array[8]; To Initialize FT: i  [0,7], FT[i] = _get_vaddr(&array[i]) To Access array[ ]: i  [0,7], array[i] = _get_value_at(FT[i])

Different Strategies – Reduced-Table: Infinite blocksize RT Strategy: • Only one table entry in this case • Address calculation step is simple in that case BLOCKSIZE=infinite Only first address of the element of the array needs to be saved since all array data is contiguous Consider shared [] int array[4]; To initialize RT: RT[0] = _get_vaddr(&array[0]) To access array[]: i [0,3], array[i] = _get_value_at( RT[0] + i ) array[0] array[1] i array[2] array[3] RT[0] RT[0] RT[0] RT[0] THREAD0 THREAD1 THREAD2 THREAD3

Different Strategies – Reduced-Table: Default blocksize BLOCKSIZE=1 Only first address of elements on each thread are saved since all array data is contiguous Consider shared [1] int array[16]; To initialize RT: i [0,THREADS-1], RT[i] = _get_vaddr(&array[i]) To access array[]: i [0,15], array[i] = _get_value_at( RT[i mod THREADS] + (i/THREADS)) RT Strategy: • Less memory required than FT, MMTB buffer has threads entries • Address calculation step is a bit costly but much cheaper than current implementations array[0] array[1] array[2] array[3] RT[0] array[4] array[5] array[6] array[7] RT[1] array[8] array[9] array[10] array[11] RT[2] array[12] array[13] array[14] array[15] RT[3] RT RT RT RT RT THREAD0 THREAD1 THREAD2 THREAD3

Different Strategies – Reduced-Table: Arbitrary blocksize ARBITRARY BLOCK SIZES Only first address of elements of each block are saved since all block data is contiguous Consider shared [2] int array[16]; To initialize T: i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)]) To access array[]: i [0,15], array[i] = _get_value_at( RT[i / blocksize(array)] + (i mod blocksize(array)) ) RT Strategy: • Less memory required than for FT, but more than previous cases • Address calculation step more costly than previous cases RT[0] RT[1] array[0] array[2] array[4] array[6] RT[2] array[1] array[3] array[5] array[7] RT[3] array[8] array[10] array[12] array[14] RT[4] array[9] array[11] array[13] array[15] RT[5] RT[6] RT RT RT RT RT[7] THREAD0 THREAD1 THREAD2 THREAD3 RT

Performance Impact of the MMTB – Sobel Edge Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0) • FT and RT are performing around 6 to 8 folds better than the regular basic UPC version (O0) • RT strategy slower than FT since address calculation (arbitrary block size case), becomes more complex. • FT on the other hand is performing almost as good as the hand-tuned versions (O3 and MPI)

Performance Impact of the MMTB – Matrix Multiplication Performance and Hardware Profiling of Matrix Multiplication using new MMTB strategies • FT strategy: increase in L1 data cache misses due to the large table size • RT strategy: L1 kept low, but increase in number of loads and stores is observed showing increase in computations (arbitrary blocksize used)

Time and storage requirements of the Address Translation Methods for the Matrix Multiply Microkernel (E: element size in bytes,P: pointer size in bytes) Comparison among Optimizations of Storage, Memory Accesses and Computation Requirements • Number of loads and stores can increase with arithmetic operators

By thread/index number (upc_forall integer) upc_forall(i=0; i<N; i++; i) loop body; By the address of a shared variable (upc_forall address) upc_forall(i=0; i<N; i++; &shared_var[i]) loop body; By thread/index number (for optimized) for(i=MYTHREAD; i<N; i+=THREADS) loop body; By thread/index number(for integer) for(i=0; i<N; i++) { if(MYTHREAD == i%THREADS) loop body; } By the address of a shared variable (for address) for(i=0; i<N; i++) { if(upc_threadof(&shared_var[i]) == MYTHREAD) loop body; } UPC Work-sharing Construct Optimizations

Performance of Equivalent upc_forall and for Loops

Performance Limitations Imposed by Sequential C Compilers -- STREAM

Loopmark – SET/ADD Operations Let us compare loopmarks for each F / C operation

MEMSET (bulk set) 146. 1 t = mysecond(tflag) 147. 1 V M--<><> a(1:n) = 1.0d0 148. 1 t = mysecond(tflag) - t 149. 1 times(2,k) = t SET 158. 1 arrsum = 2.0d0; 159. 1 t = mysecond(tflag) 160. 1 MV------< DO i = 1,n 161. 1 MV c(i) = arrsum 162. 1 MV arrsum = arrsum + 1 163. 1 MV------> END DO 164. 1 t = mysecond(tflag) - t 165. 1 times(4,k) = t ADD 180. 1 t = mysecond(tflag) 181. 1 V M--<><> c(1:n) = a(1:n) + b(1:n) 182. 1 t = mysecond(tflag) - t 183. 1 times(7,k) = t MEMSET (bulk set) 163. 1 times[1][k] = mysecond_(); 164. 1 memset(a, 1, NDIM*sizeof(elem_t));; 165. 1 times[1][k] = mysecond_() - times[1][k]; SET 217. 1 set = 2; 220. 1 times[5][k] = mysecond_(); 222. 1 MV--< for (i=0; i<NDIM; i++) 223. 1 MV { 224. 1 MV c[i] = (set++); 225. 1 MV--> } 227. 1 times[5][k] = mysecond_() - times[5][k]; ADD 283. 1 times[10][k]= mysecond_(); 285. 1 Vp--< for (j=0; j<NDIM; j++) 286. 1 Vp { 287. 1 Vp c[j] = a[j] + b[j]; 288. 1 Vp--> } 290. 1 times[10][k] = mysecond_() - times[10][k]; Loopmark – SET/ADD Operations Fortran C Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed

UPC vs CAF using the NPB workloads • In General, UPC slower than CAF, mainly due to • Point-to-point vs barrier synchronization • Better scalability with proper collective operations • Program writers can do a p-to-p syncronization using current constructs • Scalar performance of source-to-source translated code • Alias analysis (C pointers) • Can highlight the need for explicitly using restrict to help several compiler backends • Lack of support for multi-dimensional arrays in C • Can prevent high level loop transformations and software pipelining, causing a 2 times slowdown in SP for UPC • Need for exhaustive C compiler analysis • A failure to perform proper loop fusion and alignment in the critical section of MG can lead to 51% more loads for UPC than CAF • A failure to unroll adequately the sparse matrix-vector multiplication in CG can lead to more cycles in UPC

Conclusions • UPC is a locality-aware parallel programming language • With proper optimizations, UPC can outperform MPI in random short accesses and can otherwise perform as good as MPI • UPC is very productive and UPC applications result in much smaller and more readable code than MPI • UPC compiler optimizations are still lagging, in spite of the fact that substantial progress has been made • For future architectures, UPC has the unique opportunity of having very efficient implementations as most of the pitfalls and obstacles are revealed along with adequate solutions

Conclusions • In general, four types of optimizations: • Optimizations to Exploit the Locality Consciousness and other Unique Features of UPC • Optimizations to Keep the Overhead of UPC low • Optimizations to Exploit Architectural Features • Standard Optimizations that are Applicable to all Systems Compilers

Conclusions • Optimizations possible at three levels: • Source to source program acting during the compilation phase and incorporating most UPC specific optimizations • C backend compilers to compete with Fortran • Strong run-time system that can work effectively with the Operating System

Selected Publications • T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: Distributed Shared Memory Programming. John Wiley &Sons Inc., New York, 2005. ISBN: 0-471-22048-5. (June 2005) • T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, Benchmarking Parallel Compilers for Distributed Shared Memory Languages: A UPC Case Study, Journal of Future Generation Computer Systems, North-Holland (Accepted)

Towards Optimized UPC Implementations

Towards Optimized UPC Implementations

Presentation Transcript

UPC Applications

UPC Codes

Graph Implementations

Stack Implementations

European Implementations

Reference implementations

Optimized! www.MaryBowling.com

Beamformer implementations

IMPLEMENTATIONS….

ZING implementations

UPC TEAM

UPC Location

UPC Applications

Towards FPGA Architectures Optimized For Cryptographic Algorithms

Hybrid Approaches Towards Optimized Network Discovery Techniques

UPC Codes

UPC TEAM

Optimized Perfusion

Towards Data Grid Standard Implementations

List Implementations

Various Implementations