A Performance Characterization of UPC

A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen CSE 260 – Parallel Processing UCSD Fall 2006

Introduction • Unified Parallel C (UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Similar to the C language philosophy • Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP CSE 260 – Parallel Processing UCSD Fall 2006

UPC Execution Model • A number of threads working independently in a SPMD fashion • Number of threads specified at compile-time or run-time; available as program variable THREADS • MYTHREAD specifies thread index (0..THREADS-1) • upc_barrier is a global synchronization: all wait CSE 260 – Parallel Processing UCSD Fall 2006

Simple Shared Memory Example shared [1] int data[4][THREADS] Block Size Array Size Thread Number Thread n Thread 0 Thread 1 Thread 2 Thread 3 0,n 1,n 2,n 3,n 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 … CSE 260 – Parallel Processing UCSD Fall 2006

Example: Monte Carlo Pi Calculation • Estimate Pi by throwing darts at a unit square • Calculate percentage that fall in the unit circle • Area of square = r2 = 1 • Area of circle quadrant = ¼ * p r2 = p/4 • Randomly throw darts at x,y positions • If x2 + y2 < 1, then point is inside circle • Compute ratio: • # points inside / # points total • p = 4*ratio CSE 260 – Parallel Processing UCSD Fall 2006

Monte Carlo Pi Scaling CSE 260 – Parallel Processing UCSD Fall 2006

Ring Performance - DataStar CSE 260 – Parallel Processing UCSD Fall 2006

Ring Performance - Spindel CSE 260 – Parallel Processing UCSD Fall 2006

Ring Performance - DataStar CSE 260 – Parallel Processing UCSD Fall 2006

Ring Performance - Spindel CSE 260 – Parallel Processing UCSD Fall 2006

1 7 5 2 8 4 6 3 1 7 5 2 8 4 6 3 1 7 5 2 8 4 6 3 1 7 5 2 8 4 6 3 Parallel Binary sort CSE 260 – Parallel Processing UCSD Fall 2006

1 2 3 4 5 6 7 8 1 2 5 7 3 4 6 8 1 7 2 5 4 8 3 6 1 7 5 2 8 4 6 3 Parallel Binary sort (cont..) CSE 260 – Parallel Processing UCSD Fall 2006

MPI Binary sort scaling(Spindel Test Cluster) CSE 260 – Parallel Processing UCSD Fall 2006

A Performance Characterization of UPC Fallon Chen CSE 260 – Parallel Processing UCSD Fall 2006

Matrix Multiply • Basic square matrix multiply: A x B = C • A, B and C are NxN matrices • In UPC, we can take advantage of the data layout for matrix multiply when N is a multiple of the number of THREADS • Store A row wise • Store B column wise

Data Layout Thread THREADS-1 Thread 0 P M 0 .. (N*P / THREADS) -1 Thread 0 (N*P / THREADS)..(2*N*P / THREADS)-1 Thread 1 N P ((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1 Thread THREADS-1 • Note: N and M are assumed to be multiples of THREADS Columns 0: (M/THREADS)-1 (images by Kathy Yelick, from the UPC Tutorial) CSE 260 – Parallel Processing UCSD Fall 2006

Algorithm • At each thread, get a local copy of the row(s) of A that have affinity to that particular thread • At each thread, broadcast column using a UPC collective function of B so that at the end each thread has a copy of B • Multiply the row of A by B to produce a row (or rows) of C • Very short– about 100 lines of code CSE 260 – Parallel Processing UCSD Fall 2006

Connected Components Labeling • Used a union find algorithm for global relabeling • Stored global labels as a shared array, and used a shared array to exchange ghost cells • Directly accessing a shared array in a loop is slow for large amounts of data • Need to use bulk copies upc_memput and upc_memget, but then you have to attend carefully to how data is laid out (see next two slides for what happens if you don’t)

UPC CCL Scaling CSE 260 – Parallel Processing UCSD Fall 2006

Did UPC help, hurt? • Global view of memory useful aid in debugging and development • Redistribution routines pretty easy to write • Efficient code no easier to write than in MPI because you have to consider the shared memory data layout when fine tuning the code CSE 260 – Parallel Processing UCSD Fall 2006

Conclusions • UPC is easy to program in for C writers, significantly easier than alternative paradigms at times • UPC exhibits very little overhead when compared with MPI for problems that are embarrassingly parallel. No tuning is necessary. • For other problems compiler optimizations are happening but not fully there • With hand-tuning , UPC performance compared favorably with MPI • Hand tuned code, with block moves, is still substantially simpler than message passing code CSE 260 – Parallel Processing UCSD Fall 2006

A Performance Characterization of UPC