Parallel P rogramming T rends in E xtremely S calable A rchitectures

Parallel Programming Trends in Extremely Scalable Architectures Carlo Cavazzoni, HPC department, CINECA

CINECA CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote technology transfer initiatives for industry.

Why parallel programming? • Solve larger problems • Run memory demanding codes • Solve problems with greater speed

Modern Parallel Architectures • Two basic architectural scheme: • Distributed Memory • Shared Memory • Now most computers have a mixed architecture • + accelerators -> hybrid architectures

Distributed Memory memory memory node node CPU CPU memory memory NETWORK node node CPU CPU memory memory node node CPU CPU

Shared Memory memory CPU CPU CPU CPU CPU

Real Shared Memory banks System Bus CPU CPU CPU CPU CPU

Virtual Shared Network memory memory memory memory memory memory HUB HUB HUB HUB HUB HUB CPU CPU CPU CPU CPU CPU node node node node node node

Mixed Architectures memory memory memory CPU CPU node CPU CPU CPU CPU node node NETWORK

switched Cube, hypercube, n-cube switch Torus in 1,2,...,N Dim Fat Tree Most Common Networks

HPC Trends

Top500 Paradigm Change in HPC …. What about applications? Next HPC system installed in CINECA will have 200000 cores

Roadmap to Exascale(architectural trends)

The core frequency and performance do not grow following the Moore’s law any longer L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P CPU + Accelerator to maintain the architectures evolution In the Moore’s law The power crisis! Dennard Scaling law (MOSFET) do not hold anymore! • L’ = L / 2 • V’ = V / 2 • F’ = F * 2 • D’ = 1 / L2 = 4D • P’ = P Programming crisis!

Where Watts are burnt? • Today (at 40nm) moving 3 64bit operands to compute a 64bit floating-point FMA takes 4.7x the energy with respect to the FMA operation itself A B C D = A + B* C Extrapolating down to 10nm integration, the energy required to move date Becomes 100x !

MPP System Arch Option for BG/Q • When? 2012 PFlop/s >2 Power >1MWatt Cores >150000 Threads >500000

CPU ACC. Physical integration CPU & ACC Architectural integration Accelerator A set (one or more) of very simple execution units that can perform few operations (with respect to standard CPU) with very high efficiency. When combined with full featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system. (Carlo Cavazzoni) CPU ACC. Single thread perf. throughput

nVIDIA GPU Fermi implementation will pack 512 processor cores

ATI FireStream, AMD GPU 2012 New Graphics Core Next “GCN” With new instruction set and new SIMD design

Intel MIC (Knight Ferry)

What about parallel App? • In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

Programming Models • Message Passing (MPI) • Shared Memory (OpenMP) • Partitioned Global Address Space Programming (PGAS) Languages • UPC, Coarray Fortran, Titanium • Next Generation Programming Languages and Models • Chapel, X10, Fortress • Languages and Paradigm for Hardware Accelerators • CUDA, OpenCL • Hybrid: MPI + OpenMP + CUDA/OpenCL

trends Scalar Application MPP System, Message Passing: MPI Vector Multi core nodes: OpenMP Distributed memory Accelerator (GPGPU, FPGA): Cuda, OpenCL Shared Memory Hybrid codes

memory memory memory memory memory memory node node node node node node Internal High Performance Network CPU CPU CPU CPU CPU CPU Message Passingdomain decomposition

Processor 1 Processor 1 sub-domain boundaries i,j+1 i,j+1 i-1,j i,j i+1,j i-1,j i,j i+1,j i,j+1 Ghost Cells exchanged between processors at every update Ghost Cells i-1,j i,j i+1,j i,j-1 i,j+1 i,j+1 i-1,j i,j i+1,j i-1,j i,j i+1,j i,j-1 i,j-1 Processor 2 Processor 2 Ghost Cells - Data exchange

Message Passing: MPI • Main Characteristic • Library • Coarse grain • Inter node parallelization (few real alternative) • Domain partition • Distributed Memory • Almost all HPC parallel App Open Issue • Latency • OS jitter • Scalability

Shared memory node y Thread 0 CPU Thread 1 CPU memory Thread 2 CPU x Thread 3 CPU

Shared Memory: OpenMP • Main Characteristic • Compiler directives • Medium grain • Intra node parallelization (pthreads) • Loop or iteration partition • Shared memory • Many HPC App Open Issue • Thread creation overhead • Memory/core affinity • Interface with MPI

OpenMP • !$omp parallel do • do i = 1 , nsl • call 1DFFT along z ( f [ offset( threadid ) ] ) • end do • !$omp end parallel do • call fw_scatter ( . . . ) • !$omp parallel • do i = 1 , nzl • !$omp parallel do • do j = 1 , Nx • call 1DFFT along y ( f [ offset( threadid ) ] ) • end do • !$omp parallel do • do j = 1, Ny • call 1DFFT along x ( f [ offset( threadid ) ] ) • end do • end do • !$omp end parallel

+ Sum of 1D array Accelerator/GPGPU

CUDA sample void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; }} __global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; }} Each thread execute one loop iteration

CUDAOpenCL Main Characteristic • Ad-hoc compiler • Fine grain • offload parallelization (GPU) • Single iteration parallelization • Ad-hoc memory • Few HPC App Open Issue • Memory copy • Standard • Tools • Integration with other languages

Hybrid (MPI+OpenMP+CUDA+… • Take the positive off all models • Exploit memory hierarchy • Many HPC applications are adopting this model • Mainly due to developer inertia • Hard to rewrite million of source lines …+python)

Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Quantum ESPRESSO http://www.qe-forge.org/

Storage I/O • The I/O subsystem is not keeping the pace with CPU • Checkpointing will not be possible • Reduce I/O • On the fly analysis and statistics • Disk only for archiving • Scratch on non volatile memory (“close to RAM”)

PRACE European (PRACE) • PRACE Research Infrastructure (www.prace-ri.eu) • the top level of the European HPC ecosystem • The vision of PRACE is to enable and support European global leadership in public and private research and development. • CINECA (representing Italy) • is an hosting member • of PRACE • can host a Tier-0 system Tier 0 Tier 1 Local National (CINECA today) Tier 2

FERMI @ CINECAPRACE Tier-0 System Architecture: 10 BGQ Frame Model: IBM-BG/Q Processor Type: IBM PowerA2, 1.6 GHz Computing Cores: 163840 Computing Nodes: 10240 RAM: 1GByte / core Internal Network: 5D Torus Disk Space: 2PByte of scratch space Peak Performance: 2PFlop/s ISCRA & PRACE call for projects now open!

Conclusion Parallel programming trends in extremely scalable architectures • Exploit millions of ALU • Hybrid Hardware • Hybrid codes • Memory Hierarchy • Flops/Watt (more that Flops/Sec) • I/O subsystem • Non volatile memory • Fault Tolerance!

Parallel P rogramming T rends in E xtremely S calable A rchitectures

Parallel P rogramming T rends in E xtremely S calable A rchitectures

Presentation Transcript

e X treme P rogramming

P E T A L S

P E T S

e X treme P rogramming

P e t s

p rogramming in python

e X treme P rogramming

Advanced P rogramming

P a t t e r n d a t a b a s e s

S T E P S

T R A N S E P T

s i m p l e p a s t

S T E P U P

S h a p e s a n d P a t t e r n s

P A S T T E N S E

e X treme P rogramming

e X treme P rogramming