1 / 20

Programming Heterogeneous (GPU) Systems

Programming Heterogeneous (GPU) Systems. Jeffrey Vetter. Presented to Extreme Scale Computing Training Program ANL: St. Charles, IL 2 August 2013 . http:// ft.ornl.gov  vetter@computer.org. TH-2 System. Compute Nodes have 3.432 Tflop /s per node 16,000 nodes 32000 Intel Xeon cpus

cid
Download Presentation

Programming Heterogeneous (GPU) Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Heterogeneous (GPU) Systems Jeffrey Vetter Presented to Extreme Scale Computing Training Program ANL: St. Charles, IL 2 August 2013 http://ft.ornl.govvetter@computer.org

  2. TH-2 System • Compute Nodes have 3.432 Tflop/s per node • 16,000 nodes • 32000 Intel Xeon cpus • 48000 Intel Xeon phis • Operations Nodes • 4096 FT CPUs as operations nodes • Proprietary interconnect TH2 express • 1PB memory (host memory only) • Global shared parallel storage is 12.4 PB • Cabinets: 125+13+24 = 162 compute/communication/storage cabinets • ~750 m2 • NUDT and Inspur

  3. ORNL’s “Titan” Hybrid System:Cray XK7 with AMD Opteron and NVIDIA Tesla processors • SYSTEM SPECIFICATIONS: • Peak performance of 27.1 PF • 24.5 GPU + 2.6 CPU • 18,688 Compute Nodes each with: • 16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU • 32 + 6 GB memory • 512 Service and I/O nodes • 200 Cabinets • 710 TB total system memory • Cray Gemini 3D Torus Interconnect • 8.9 MW peak power 4,352 ft2

  4. http://keeneland.gatech.edu Keeneland – Full Scale System Rack (6 Chassis) Keeneland System (11 Compute Racks) S6500 Chassis (4 Nodes) ProLiant SL250 G8 (2CPUs, 3GPUs) M2090 Xeon E5-2670 614450 GFLOPS 55848 GFLOPS 9308 GFLOPS 2327GFLOPS 32/18 GB Mellanox 384p FDR InfiniBand Switch 665GFLOPS 166GFLOPS Full PCIeG3 X16 bandwidth to all GPUs Integrated with NICS Datacenter Lustre and XSEDE J.S. Vetter, R. Glassbrook et al., “Keeneland: Bringing heterogeneous GPU computing to the computational science community,” IEEE Computing in Science and Engineering, 13(5):90-5, 2011, http://dx.doi.org/10.1109/MCSE.2011.83.

  5. Contemporary HPC Architectures

  6. Emerging Computing Architectures • Heterogeneous processing • Many cores • Fused, configurable memory • Memory • 3D Stacking • New devices (PCRAM, ReRAM) • Interconnects • Collective offload • Scalable topologies • Storage • Active storage • Non-traditional storage architectures (key-value stores) • Improving performance and programmability in face of increasing complexity • Power, resilience HPC (all) computer design is more fluid now than in the past two decades.

  7. AMD Llano’s fused memory hierarchy K. Spafford, J.S. Meredith, S. Lee, D. Li, P.C. Roth, and J.S. Vetter, “The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Architectures,” in ACM Computing Frontiers (CF). Cagliari, Italy: ACM, 2012. Note: Both SB and Llano are consumer parts, not server parts.

  8. Future Directions in Heterogeneous Computing • Over the next decade: Heterogeneous computing will continue to increase in importance • Manycore • Hardware features • Transactional memory • Random Number Generators • Scatter/Gather • Wider SIMD/AVX • Synergies with BIGDATA, mobile markets, graphics • Top 10 list of features to include from application perspective. Now is the time!

  9. Applications must use a mix of programming models

  10. CommunicationMPI Profiling

  11. Communication – MPI • MPI dominates HPC • Communication can severely restrict performance and scalability • Developer has explicit control of MPI in application • Communication computation overlap • Collectives • MPI tools provide wealth of information • Statistics – number and size of message sent in certain time • Tracing – event based log per task of all communication events Georgia Tech / Computational Science and Engineering / Vetter

  12. MPI Provides the MPI Profiling Layer • MPI Spec provides the MPI Profiling Layer to allow interposition between application and MPI runtime • PERUSE is a recent attempt to provide more detailed information from the runtime for performance measurement • http://www.mpi-peruse.org/ Georgia Tech / Computational Science and Engineering / Vetter

  13. MPI Performance Tools Provide Varying Levels of Detail • MPIP (http://mpip.sourceforge.net/) • Statistics on • Counts, sizes, min, max for Point-to-point and collective operations • MPI IO counts, sizes, min, max • Lightweight • Has scaled to 64k processors on BGL • No large tracefiles • Low perturbation • Callsite specific information • Tau, Vampir, Intel Tracing Tool, Paraver • Statistical and tracing information • Varying levels of complexity, perturbation, and tracefile size • Paraver (http://www.bsc.es/plantillaA.php?cat_id=486) • Covered in detail here Georgia Tech / Computational Science and Engineering / Vetter

  14. MPI Profiling

  15. Why do these systems have different performance on POP? Georgia Tech / Computational Science and Engineering / Vetter

  16. MPI Performance Profiling: mpiP • mpiP Basics • Easy to use tool • Statistical-based MPI profiling library • Requires relinking but no source level changes • Compiling with “-g” is recommended • Provides average times for each MPI call site • Has been shown to be very useful for scaling analysis Georgia Tech / Computational Science and Engineering / Vetter

  17. MPIP example • POP MPI performance @ mpiP @ Command : ./pop @ Version : 2.4 @ MPIP Build date : Jul 18 2003, 11:41:57 @ Start time : 2003 07 18 15:01:16 @ Stop time : 2003 07 18 15:03:53 @ MPIP env var : [null] @ Collector Rank : 0 @ Collector PID : 25656 @ Final Output Dir : . @ MPI Task Assignment : 0 h0107.nersc.gov @ MPI Task Assignment : 1 h0107.nersc.gov Georgia Tech / Computational Science and Engineering / Vetter

  18. More mpiP Output for POP --------------------------------------------------------------------------- @--- MPI Time (seconds) --------------------------------------------------- --------------------------------------------------------------------------- Task AppTime MPITime MPI% 0 157 1.89 1.21 1 157 6.01 3.84 * 313 7.91 2.52 --------------------------------------------------------------------------- @--- Callsites: 6 --------------------------------------------------------- --------------------------------------------------------------------------- ID Lev File Line Parent_Funct MPI_Call 1 0 global_reductions.f 0 ?? Wait 2 0 stencils.f 0 ?? Waitall 3 0 communicate.f 3122 .MPI_Send Cart_shift 4 0 boundary.f 3122 .MPI_Send Isend 5 0 communicate.f 0 .MPI_Send Type_commit 6 0 boundary.f 0 .MPI_Send Isend Georgia Tech / Computational Science and Engineering / Vetter

  19. Still More mpiP Output for POP --------------------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ---------------- --------------------------------------------------------------------------- Call Site Time App% MPI% Waitall 4 2.22e+03 0.71 28.08 Waitall 6 1.82e+03 0.58 23.04 Wait 1 1.46e+03 0.46 18.41 Waitall 2 831 0.27 10.51 Allreduce 1 499 0.16 6.31 Bcast 1 275 0.09 3.47 Isend 2 256 0.08 3.24 Isend 4 173 0.06 2.18 Barrier 1 113 0.04 1.43 Irecv 2 80.3 0.03 1.01 Irecv 4 40.6 0.01 0.51 Cart_create 3 28 0.01 0.35 Cart_coords 3 17.4 0.01 0.22 Type_commit 5 12.7 0.00 0.16 Georgia Tech / Computational Science and Engineering / Vetter

  20. Remaining mpiP Output for POP --------------------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ---------------- --------------------------------------------------------------------------- Isend 1 12.7 0.00 0.16 Bcast 3 12.4 0.00 0.16 Barrier 5 12.2 0.00 0.15 Cart_shift 5 12 0.00 0.15 Irecv 1 10.7 0.00 0.13 Isend 6 9.28 0.00 0.12 --------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 53 -------------------------- --------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Allreduce 1 0 1121 2.35 0.182 0.079 0.13 10.79 Allreduce 1 1 1121 11.1 0.263 0.129 0.19 4.90 Allreduce 1 * 2242 11.1 0.222 0.079 0.16 6.31 Georgia Tech / Computational Science and Engineering / Vetter

More Related