Software Engineering Challenges of the H2020 DEEP-EST Modular Supercomputing Architecture

Software Engineering Challenges of the H2020 DEEP-EST Modular Supercomputing Architecture Helmut Neukirchen University of Iceland helmut@hi.is The research leading to these results has received funding from the European Union Horizon 2020 – the Framework Programme for Research and Innovation (2014-2020) under Grant Agreement n°754304

Moore’s law (1965) • Number of transistors in an integrated circuit (IC) doubles every two years. • Still valid! • But believed to end approx. 2025-2030. • Clock speed & performance per clock cycle used to double each as well every two years. • Not true anymore! http://wccftech.com/moores-law-will-be-dead-2020-claim-experts-fab-process-cpug-pus-7-nm-5-nm/ Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 2

For what is the doubling of number transistors used? • More cores, bigger caches. • Today’s main way to achieve speed: • Parallel processing: • Many cores per CPU, • Many CPU nodes. • Needs to be supported by software  Software Engineering (SE) challenges! https://hpc.postech.ac.kr/~jangwoo/research/research.html Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 3

Parallel processing programming models (1/2) • Big data processing (Apache Hadoop, Apache Spark): • Write serial high-level code, frameworks take care of parallelization, • Nice from SE point of view. • Performance not on par with HPC. • Partly due to Java/Scala, Python: 10x slower than C/C++. F Glaser, H Neukirchen, T Rings, J Grabowski: Using MapReduce for High Energy Physics Data Analysis. Proc. Int. Symp. on MapReduce and Big Data Infrastructure (MR.BDI), 2013. H Neukirchen: Performance of Big Data versus High-Performance Computing: Some Observations. Extended Abstract. Clausthal-Göttingen Int. Workshop on Simulation Science (SimScience), 2017. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 4

Parallel processing programming models (2/2) • High-Performance Computing (HPC): • Rather low-level code (C, C++, Fortran), explicit parallelization. • Message Passing Interface (MPI): communication between nodes. • MPI_Send, MPI_Recv,…, MPI_Bcast, MPI_Reduce • OpenMP: Multithreading (single node), communication via shared memory. • SE point of view: bad! • MPI pretty low-level. • Well, OpenMPI pragma annotations not that bad. • Both are prevailing: lot’s of codes exist, we have to live with them! Broadcast Reduce received broadcast responses, e.g. sum, max. #pragma omp parallel for for(i=0; i<N; i++) { c[i] = a[i] + b[i]; } Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 5

Scalability: Strong scaling • Naïve view: more cores  higher speedup. • Amdahl’s law (1967): non-parallel parts limit speedup. • “Strong scaling”: increase number of cores for constant problem size. http://www.drdobbs.com/parallel/break-amdahls-law/205900309 Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 6

Scalability: Weak scaling • While Amdahl’s law is true, we can still get more work done per time. (Gustafson’s law 1988) • “Weak scaling”: increase number of cores and problem size. • Instead of considering fixed problem size: • Fixed runtime (Amdahl’s law), but: • Possible to grow problem size while growing number of cores. http://www.drdobbs.com/parallel/break-amdahls-law/205900309 Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 7

Beyond Amdahl’s and Gustafson’s laws?! • Even weak scaling has limits: • Parallelisation overhead, • E.g. communication/co-ordination. • Make parallel parts faster. • Clock speed limits reached, but more suitable hardware possible (“accelerators”). • E.g. Graphics Processing Unit (GPU), • Many Integrated Cores (MIC). • Make sequential parts faster. • More suitable hardware, e.g. do things in hardware that were before done in software by CPUs. • E.g. using Field-Programmable Gate Array (FPGA)). • E.g. to speed-up communication. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 8

Towards Exascale computing • Towards Exascale via accelerators (typically shared memory): • General Purpose GPU (many “shader” cores): • NVIDIA Tesla K20x (2688 cores), P100 (3584 cores), Volta (5120 cores). • Many Integrated Cores (MIC): • Intel Xeon Phi (72 x86 cores), Sunway SW26010 (260 RISC cores). • Homogenous HPC Cluster  Heterogeneous HPC cluster. ≈100 Peta Flop/s=0.1 Exa Flop/s TOP500 6/2017 (update: next week)https://www.top500.org/lists/2017/06/ 9

EU Exascale projects: DEEP, DEEP-ER • Flexible integration of accelerators and special hardware. • “Blueprints” of future Exascale technology: small prototype platforms. • Co-design: HPC applications drive hardware requirements. • 2011-2015: DEEP (Dynamical Exascale Entry Platform): • Standard cluster + flexible attachment of MIC “booster”. • 2013-2017: DEEP-ER (Dynamical Exascale Entry Platform – Extended Reach): • E.g. I/O and storage for fast checkpointing (resiliency/fault tolerance). Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 10

EU Exascale projects: DEEP-EST • 7/2017-6/2020: DEEP-EST (Dynamical Exascale Entry Platform – Extreme Scale Technologies): • MSA Modular Supercomputing Architecture: more accelerators, e.g.: • Network Attached Memory (NAM), FPGA-supported communication. • EC Funding: 14.998.342 € • http://www.deep-est.eu • University of Iceland provides machine learning applications. • Port HPC machine learning implementations to DEEP-EST hardware. • Clustering (DBSCAN), Classification (SVM), Deep learning (Neural networks). • Software Engineering more a voluntary side aspect (no deliverables). Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 11

GPU GPU InfiniBandinterconnect GPU GPU GPU GPU Standard heterogeneous cluster CPU CPU • Disadvantages: • Static assignment of CPUs to GPUs. • What if one CPU process needs more than one GPU? • Accelerators not capable to act autonomously. CPU CPU CPU CPU Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 12

Acc CPU Acc CPU Acc Acc CPU Acc DEEP heterogeneous cluster Booster Cluster • Go for more capable accelerators (MIC). • All nodes might act autonomously. • Attach all nodes to a low-latency fabric (EXTOLL: latency ‹‹ InfiniBand). • MICs can communicate even with each other (via MPI). • Dynamical assignment of cluster and booster nodes! • Run code with limited scalability (energy-)efficiently on cluster, • Run highly scaleable code (energy-)efficiently on booster. • Additional SSD Non-volatile memory (NVM) attached to booster nodes. InfiniBandinterconnect EXTOLLfabric Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 13

DEEP-EST: Modular Supercomputing Architecture (MSA), e.g. NAM • Allow more HW modules than just the MIC booster, e.g.: • Network Attached Memory (NAM): • Introduced in DEEP-ER as extreme fast /tmp for checkpointing. • Network attached RAM storage, • access through FPGA via InfiniBand & EXTOLL fabric from cluster & booster. • DEEP-EST: extend NAM to support general fast volatile storage. • E.g. for storing intermediate machine learning data. • Let FPGA even do calculations on the stored data (Near-Data Processing). • E.g. offload machine learning steps. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 14

DEEP-EST: Modular Supercomputing Architecture (MSA), e.g. GCE • Allow more HW modules than just the MIC booster, e.g.: • Global Collective Engine (GCE): • HW-supported MPI collectives. • E.g. MPI_Reduce in FPGA: • FPGA might be even re-programmedto support custom functions beyond standard max, sum, etc. • Many more HW modules thinkable, e.g.: • GPU booster module, • Data Analytics Module. • In the style of Google’s Tensor Processing Units (TPU) accelerators. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 15

Software Engineering challenges resulting from heterogeneous clusters • Homogenous cluster: • Huge investment into MPI/OpenMP codes. • Code base grown over decades. • Heterogeneous cluster: • GPUs: need to re-code existing codes completely. • (CUDA for NVIDIA, OpenCL for AMD, NVIDIA, etc.). • Additional investment needed. • MIC: OpenMP multi-core annotations can be re-used for many-core! • OpenCL maps also on MIC. • SE point of view: heterogeneous clusters are bad! – Well: • MIC (due to OpenMP code re-use) not as bad as GPUs. • Code using ArrayFire array library can map to CUDA, OpenCL, CPUs. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 16

Software Engineering challenges in DEEP-EST • Modular Supercomputing Architecture of DEEP-EST:adds even more heterogeneity! • Even worse from SE point of view: more code adjustments needed! • DEEP-EST=“die Pest”!? • Seems to be the price to pay on the road to Exascale… • Well, DEEP did some homework wrt. MIC booster: • MPI implementation of DEEP abstracts away hardware details. • MPI can be used in a unified way to communicate globally via MPI between cluster and booster side. • Not possible in standard MIC approaches. • OmpSs (add-on to OpenMP): • Pragmas to specify what code parts to run on booster/cluster nodes. • Pragmas and run-time system to give impression of shared memory between separate address spaces of booster and cluster nodes. • Example next 3 slides… http://www.toll-collect-blog.de/ Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 17

OmpSs OpenMP extension used in DEEP-* (1) • copy_* pragma to copy between different address spaces. • E.g. share array A between two nodes. • Run-time system copies as necessary. Shared array Might run on Cluster node A A B CPU CPU Might run on Cluster node B http://www.training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOverviewXT.pdf Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 18

OmpSs OpenMP extension used in DEEP-* (2) • target device( ) pragma to specify target (smp, cuda, mic). • Still: share array between GPU and CPU address spaces. • Run-time system copies as necessary. Run on CPU B GPU CPU Run on GPU http://www.training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOverviewXT.pdf Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 19

OmpSs OpenMP extension used in DEEP-* (3) • in, out, inout pragma to specify data flow dependencies. • Allows to omit synchronisation pragmas (taskwait). • Run-time system waits as necessary. B GPU Data flow CPU Data flow http://www.training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOverviewXT.pdf Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 20

No homework done in DEEP-ER: API for NAM (Network Attached Memory) • libNAM API does not match existing APIs. :-( • Possible SE Solutions: • Add NAM as further address space of OmpSs to give a unified access? • Well, OmpSs is for multi-threading. NAM access wouldnot be a separate thread. • FPGA processing might be a case. • Refactoring (systematic source code transformations, preferablyautomated by a tool) to introduceNAM API calls. • Replace POSIX file operations. • Replace memory access (malloc) • M.Sc. student started to work on C refactoring for HPC (targeting, e.g., MPI usage). Note: more advanced API for Near-Data Processing (FPGA processing of data stored in NAM), e.g. similarto MPI_Reduce. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 21

Software Engineering challenges in DEEP-ESTSupport (Re-)Engineering via Interaction Room • Successfully used in business information system development. • Make domain experts and SE experts talk, • Capture knowledge in semi-structured way. • An Interaction Room is • a dedicated room for the project team • where domain experts and software development experts feel at home • with large whiteboards (paper or digital canvas)on the walls • but without a classic conference table • to visualize and discuss key project aspects informally • instead of going over tedious documents. M Book, S Grapenthin, and V Gruhn: “Seeing the forest and the trees: focusing team interaction on value and effort drivers”, Int. Symp. On Foundations of SE, 2012. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 22

Example: Process canvas for an information system with annotations to capture implicit assumptions/knowledge Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 23

Interaction Room canvasses to discover & capture implicit assumptions embodied in HPC code M Book, M Riedel, H Neukirchen, M Götz. Facilitating Collaboration in High Performance Computing Projects with an Interaction Room. Int. Workshop on SE for Parallel Systems (SEPS), 2017. Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture • Problem canvas: • Goal and scope of research question (the domain). • Real-world canvas: • Description of the pertinent aspects of the science domain. • Decomposition canvas: • Breakdown of scientific model into parallelizable units. • Architecture canvas: • Implementation of simulation on suitable HPC technology/HW architecture. • Not necessary sequential use, but iterative refinement of canvas contents.

Conclusion • Road to Exascale via accelerators. • DEEP-EST Modular Supercomputer Architecture: • Include flexibly accelerators via EXTOLL fabric to support different workloads. • HPC bad from Software Engineering point of view: • Low level HPC code: bad! Heterogeneity even worse! • Higher abstractions (e.g. Apache Spark RDDs) would help, but slower runtime. • OpenMP/OmpSs are a first step towards higher level. • In general: lack of SE for HPC! • Currently together with Software Engineering for Distributed Systems Group: Survey of state of SE usage by HPC developers. • Current work at University of Iceland to support HPC by SE: • HPC refactoring/patterns, • Interaction room for HPC. • Outlook DEEP-EST: • Just started. Not yet experience from porting HPC codes to new HW. • If there is a follow-up project to DEEP, DEEP-ER, and DEEP-EST: how will it be named? Helmut Neukirchen: SE Challenges of DEEP-EST Modular Supercomputing Architecture 25

Software Engineering Challenges of the H2020 DEEP-EST Modular Supercomputing Architecture

Software Engineering Challenges of the H2020 DEEP-EST Modular Supercomputing Architecture

Presentation Transcript

Deep Blue Architecture

Modular Software/ Component Software

The Future of Software Architecture

Deep Blue Architecture

The Future of Supercomputing

CSE503: Software Engineering Software architecture

Software Engineering Model Driven Architecture

Software Engineering Model Driven Architecture

The Context of Software Architecture

The Future of Supercomputing

Challenges in Automotive Software Engineering

Challenges in Agent-oriented Software Engineering

H2020 and Societal Challenges

Defining the Software Engineering Profession: Challenges Ahead

SUNDER DEEP COLLEGE OF ARCHITECTURE

Software Engineering Challenges of Deep Learning

Software Engineering: Challenges of Distributed Projects Global Projects

Challenges of Microservice Architecture