1 / 24

A Performance Characterization of UPC

A Performance Characterization of UPC. Presented by – Anup Tapadia Fallon Chen. Introduction. Unified Parallel C (UPC) is: An explicit parallel extension of ANSI C A partitioned global address space language Similar to the C language philosophy Concise and efficient syntax

andres
Download Presentation

A Performance Characterization of UPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen CSE 260 – Parallel Processing UCSD Fall 2006

  2. Introduction • Unified Parallel C (UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Similar to the C language philosophy • Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP CSE 260 – Parallel Processing UCSD Fall 2006

  3. UPC Execution Model • A number of threads working independently in a SPMD fashion • Number of threads specified at compile-time or run-time; available as program variable THREADS • MYTHREAD specifies thread index (0..THREADS-1) • upc_barrier is a global synchronization: all wait CSE 260 – Parallel Processing UCSD Fall 2006

  4. Simple Shared Memory Example shared [1] int data[4][THREADS] Block Size Array Size Thread Number Thread n Thread 0 Thread 1 Thread 2 Thread 3 0,n 1,n 2,n 3,n 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 … CSE 260 – Parallel Processing UCSD Fall 2006

  5. Example: Monte Carlo Pi Calculation • Estimate Pi by throwing darts at a unit square • Calculate percentage that fall in the unit circle • Area of square = r2 = 1 • Area of circle quadrant = ¼ * p r2 = p/4 • Randomly throw darts at x,y positions • If x2 + y2 < 1, then point is inside circle • Compute ratio: • # points inside / # points total • p = 4*ratio CSE 260 – Parallel Processing UCSD Fall 2006

  6. Monte Carlo Pi Scaling CSE 260 – Parallel Processing UCSD Fall 2006

  7. Ring Performance - DataStar CSE 260 – Parallel Processing UCSD Fall 2006

  8. Ring Performance - Spindel CSE 260 – Parallel Processing UCSD Fall 2006

  9. Ring Performance - DataStar CSE 260 – Parallel Processing UCSD Fall 2006

  10. Ring Performance - Spindel CSE 260 – Parallel Processing UCSD Fall 2006

  11. 1 7 5 2 8 4 6 3 1 7 5 2 8 4 6 3 1 7 5 2 8 4 6 3 1 7 5 2 8 4 6 3 Parallel Binary sort CSE 260 – Parallel Processing UCSD Fall 2006

  12. 1 2 3 4 5 6 7 8 1 2 5 7 3 4 6 8 1 7 2 5 4 8 3 6 1 7 5 2 8 4 6 3 Parallel Binary sort (cont..) CSE 260 – Parallel Processing UCSD Fall 2006

  13. MPI Binary sort scaling(Spindel Test Cluster) CSE 260 – Parallel Processing UCSD Fall 2006

  14. A Performance Characterization of UPC Fallon Chen CSE 260 – Parallel Processing UCSD Fall 2006

  15. Matrix Multiply • Basic square matrix multiply: A x B = C • A, B and C are NxN matrices • In UPC, we can take advantage of the data layout for matrix multiply when N is a multiple of the number of THREADS • Store A row wise • Store B column wise

  16. Data Layout Thread THREADS-1 Thread 0 P M 0 .. (N*P / THREADS) -1 Thread 0 (N*P / THREADS)..(2*N*P / THREADS)-1 Thread 1 N P ((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1 Thread THREADS-1 • Note: N and M are assumed to be multiples of THREADS Columns 0: (M/THREADS)-1 (images by Kathy Yelick, from the UPC Tutorial) CSE 260 – Parallel Processing UCSD Fall 2006

  17. Algorithm • At each thread, get a local copy of the row(s) of A that have affinity to that particular thread • At each thread, broadcast column using a UPC collective function of B so that at the end each thread has a copy of B • Multiply the row of A by B to produce a row (or rows) of C • Very short– about 100 lines of code CSE 260 – Parallel Processing UCSD Fall 2006

  18. Connected Components Labeling • Used a union find algorithm for global relabeling • Stored global labels as a shared array, and used a shared array to exchange ghost cells • Directly accessing a shared array in a loop is slow for large amounts of data • Need to use bulk copies upc_memput and upc_memget, but then you have to attend carefully to how data is laid out (see next two slides for what happens if you don’t)

  19. UPC CCL Scaling CSE 260 – Parallel Processing UCSD Fall 2006

  20. Did UPC help, hurt? • Global view of memory useful aid in debugging and development • Redistribution routines pretty easy to write • Efficient code no easier to write than in MPI because you have to consider the shared memory data layout when fine tuning the code CSE 260 – Parallel Processing UCSD Fall 2006

  21. Conclusions • UPC is easy to program in for C writers, significantly easier than alternative paradigms at times • UPC exhibits very little overhead when compared with MPI for problems that are embarrassingly parallel. No tuning is necessary. • For other problems compiler optimizations are happening but not fully there • With hand-tuning , UPC performance compared favorably with MPI • Hand tuned code, with block moves, is still substantially simpler than message passing code CSE 260 – Parallel Processing UCSD Fall 2006

More Related