1 / 38

Parallel P rogramming T rends in E xtremely S calable A rchitectures

Parallel P rogramming T rends in E xtremely S calable A rchitectures. Carlo Cavazzoni, HPC department, CINECA. CINECA.

butch
Download Presentation

Parallel P rogramming T rends in E xtremely S calable A rchitectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming Trends in Extremely Scalable Architectures Carlo Cavazzoni, HPC department, CINECA

  2. CINECA CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

  3. Why parallel programming? • Solve larger problems • Run memory demanding codes • Solve problems with greater speed

  4. Modern Parallel Architectures • Two basic architectural scheme: • Distributed Memory • Shared Memory • Now most computers have a mixed architecture • + accelerators -> hybrid architectures

  5. Distributed Memory memory memory node node CPU CPU memory memory NETWORK node node CPU CPU memory memory node node CPU CPU

  6. Shared Memory memory CPU CPU CPU CPU CPU

  7. Real Shared Memory banks System Bus CPU CPU CPU CPU CPU

  8. Virtual Shared Network memory memory memory memory memory memory HUB HUB HUB HUB HUB HUB CPU CPU CPU CPU CPU CPU node node node node node node

  9. Mixed Architectures memory memory memory CPU CPU node CPU CPU CPU CPU node node NETWORK

  10. switched Cube, hypercube, n-cube switch Torus in 1,2,...,N Dim Fat Tree Most Common Networks

  11. HPC Trends

  12. Top500 Paradigm Change in HPC …. What about applications? Next HPC system installed in CINECA will have 200000 cores

  13. Roadmap to Exascale(architectural trends)

  14. The core frequency and performance do not grow following the Moore’s law any longer L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P CPU + Accelerator to maintain the architectures evolution In the Moore’s law The power crisis! Dennard Scaling law (MOSFET) do not hold anymore! • L’ = L / 2 • V’ = V / 2 • F’ = F * 2 • D’ = 1 / L2 = 4D • P’ = P Programming crisis!

  15. Where Watts are burnt? • Today (at 40nm) moving 3 64bit operands to compute a 64bit floating-point FMA takes 4.7x the energy with respect to the FMA operation itself A B C D = A + B* C Extrapolating down to 10nm integration, the energy required to move date Becomes 100x !

  16. MPP System Arch Option for BG/Q • When? 2012 PFlop/s >2 Power >1MWatt Cores >150000 Threads >500000

  17. CPU ACC. Physical integration CPU & ACC Architectural integration Accelerator A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni) CPU ACC. Single thread perf. throughput

  18. nVIDIA GPU Fermi implementation will pack 512 processor cores

  19. ATI FireStream, AMD GPU 2012 New Graphics Core Next “GCN” With new instruction set and new SIMD design

  20. Intel MIC (Knight Ferry)

  21. What about parallel App? • In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

  22. Programming Models • Message Passing (MPI) • Shared Memory (OpenMP) • Partitioned Global Address Space Programming (PGAS) Languages • UPC, Coarray Fortran, Titanium • Next Generation Programming Languages and Models • Chapel, X10, Fortress • Languages and Paradigm for Hardware Accelerators • CUDA, OpenCL • Hybrid: MPI + OpenMP + CUDA/OpenCL

  23. trends Scalar Application MPP System, Message Passing: MPI Vector Multi core nodes: OpenMP Distributed memory Accelerator (GPGPU, FPGA): Cuda, OpenCL Shared Memory Hybrid codes

  24. memory memory memory memory memory memory node node node node node node Internal High Performance Network CPU CPU CPU CPU CPU CPU Message Passingdomain decomposition

  25. Processor 1 Processor 1 sub-domain boundaries i,j+1 i,j+1 i-1,j i,j i+1,j i-1,j i,j i+1,j i,j+1 Ghost Cells exchanged between processors at every update Ghost Cells i-1,j i,j i+1,j i,j-1 i,j+1 i,j+1 i-1,j i,j i+1,j i-1,j i,j i+1,j i,j-1 i,j-1 Processor 2 Processor 2 Ghost Cells - Data exchange

  26. Message Passing: MPI • Main Characteristic • Library • Coarse grain • Inter node parallelization (few real alternative) • Domain partition • Distributed Memory • Almost all HPC parallel App Open Issue • Latency • OS jitter • Scalability

  27. Shared memory node y Thread 0 CPU Thread 1 CPU memory Thread 2 CPU x Thread 3 CPU

  28. Shared Memory: OpenMP • Main Characteristic • Compiler directives • Medium grain • Intra node parallelization (pthreads) • Loop or iteration partition • Shared memory • Many HPC App Open Issue • Thread creation overhead • Memory/core affinity • Interface with MPI

  29. OpenMP • !$omp parallel do • do i = 1 , nsl • call 1DFFT along z ( f [ offset( threadid ) ] ) • end do • !$omp end parallel do • call fw_scatter ( . . . ) • !$omp parallel • do i = 1 , nzl • !$omp parallel do • do j = 1 , Nx • call 1DFFT along y ( f [ offset( threadid ) ] ) • end do • !$omp parallel do • do j = 1, Ny • call 1DFFT along x ( f [ offset( threadid ) ] ) • end do • end do • !$omp end parallel

  30. + Sum of 1D array Accelerator/GPGPU

  31. CUDA sample void  CPUCode( int* input1, int* input2, int* output, int length) {                for ( int  i = 0; i < length; ++i ) {                      output[ i ] = input1[ i ] + input2[ i ];               }} __global__void  GPUCode( int* input1, int*input2, int* output, int length) {               int idx = blockDim.x * blockIdx.x + threadIdx.x;                if ( idx < length ) {                      output[ idx ] = input1[ idx ] + input2[ idx ];               }} Each thread execute one loop iteration

  32. CUDAOpenCL Main Characteristic • Ad-hoc compiler • Fine grain • offload parallelization (GPU) • Single iteration parallelization • Ad-hoc memory • Few HPC App Open Issue • Memory copy • Standard • Tools • Integration with other languages

  33. Hybrid (MPI+OpenMP+CUDA+… • Take the positive off all models • Exploit memory hierarchy • Many HPC applications are adopting this model • Mainly due to developer inertia • Hard to rewrite million of source lines …+python)

  34. Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Quantum ESPRESSO http://www.qe-forge.org/

  35. Storage I/O • The I/O subsystem is not keeping the pace with CPU • Checkpointing will not be possible • Reduce I/O • On the fly analysis and statistics • Disk only for archiving • Scratch on non volatile memory (“close to RAM”)

  36. PRACE European (PRACE) • PRACE Research Infrastructure (www.prace-ri.eu) • the top level of the European HPC ecosystem • The vision of PRACE is to enable and support European global leadership in public and private research and development. • CINECA (representing Italy) • is an hosting member • of PRACE • can host a Tier-0 system Tier 0 Tier 1 Local National (CINECA today) Tier 2

  37. FERMI @ CINECAPRACE Tier-0 System Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s ISCRA & PRACE call for projects now open!

  38. Conclusion Parallel programming trends in extremely scalable architectures • Exploit millions of ALU • Hybrid Hardware • Hybrid codes • Memory Hierarchy • Flops/Watt (more that Flops/Sec) • I/O subsystem • Non volatile memory • Fault Tolerance!

More Related