Overview of Extreme-Scale Software Research in China

Overview of Extreme-Scale Software Research in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University China-USA Computer Software Workshop Sep. 27, 2011

Outline • Related R&D efforts in China • Algorithms and Computational Methods • HPC and e-Infrastructure • Parallel programming frameworks • Programming heterogeneous systems • Advanced compiler technology • Tools • Domain specific programming support

Related R&D efforts in China • NSFC • Basic algorithms and computable modeling for high performance scientific computing • Network based research environment • Many-core parallel programming • 863 program • High productivity computer and Grid service environment • Multicore/many-core programming support • HPC software for earth system modeling • 973 program • Parallel algorithms for large scale scientific computing • Virtual computing environment

Algorithms and Computational Methods

NSFC’s Key Initiative on Algorithm and Modeling • Basic algorithms and computable modeling for high performance scientific computing • 8-year, launched in 2011 • 180 million Yuan funding • Focused on • Novel computational methods and basic parallel algorithms • Computable modeling for selected domains • Implementation and verification of parallel algorithms by simulation

HPC & e-Infrastructure

863’s key projects on HPC and Grid “High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources Major R&D activities Developing PFlops computers Building up a grid service environment--CNGrid Developing Grid and HPC applications in selected areas

CNGrid GOS Architecture Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps Core, System and App Level Services Axis Handlers for Message Level Security Tomcat(5.0.28) + Axis(1.2 rc2) J2SE(1.4.2_07, 1.5.0_07) OS (Linux/Unix/Windows) PC Server (Grid Server)

Abstractions Grid community: Agora persistent information storage and organization Grid process: Grip runtime control

CNGrid GOS deployment • CNGrid GOS deployed on 11 sites and some application Grids • Support heterogeneous HPCs: Galaxy, Dawning, DeepComp • Support multiple platforms Unix, Linux, Windows • Using public network connection, enable only HTTP port • Flexible client • Web browser • Special client • GSML client

Tsinghua University: 1.33TFlops, 158TB storage, 29 applications, 100+ users. IPV4/V6 access CNIC: 150TFlops, 1.4PB storage，30 applications, 269 users all over the country, IPv4/v6 access IAPCM: 1TFlops, 4.9TB storage, 10 applications, 138 users, IPv4/v6 access Shandong University 10TFlops, 18TB storage, 7 applications, 60+ users, IPv4/v6 access GSCC: 40TFlops, 40TB, 6 applications, 45 users , IPv4/v6 access SSC: 200TFlops, 600TB storage, 15 applications, 286 users, IPv4/v6 access XJTU: 4TFlops, 25TB storage, 14 applications, 120+ users, IPv4/v6 access USTC: 1TFlops, 15TB storage, 18 applications, 60+ users, IPv4/v6 access HUST: 1.7TFlops, 15TB storage, IPv4/v6 access SIAT: 10TFlops, 17.6TB storage, IPv4v6 access HKU: 20TFlops, 80+ users, IPv4/v6 access

CNGrid: resources 11 sites >450TFlops 2900TB storage Three PF-scale sites will be integrated into CNGrid soon

CNGrid：services and users 230services >1400users China commercial Aircraft Corp Bao Steel automobile institutes of CAS universities ……

CNGrid：applications • Supporting >700 projects • 973, 863, NSFC, CAS Innovative, and Engineering projects

Parallel programming frameworks

Jasmin: A parallel programming Framework Applications Codes extract Data Dependency Communications Parallel Computing Models form support Load Balancing Data Structures Promote Models Stencils Algorithms Library Models Stencils Algorithms separate Special Common Computers Also supported by the 973 and 863 projects

Basic ideas • Hide the complexity of programming millons of cores • Integrate the efficient implementations of parallel fast numerical algorithms • Provide efficient data structures and solver libraries • Support software engineering for code extensibility.

Basic Ideas Applications Codes PetaFlops MPP Scale up using Infrastructures TeraFlops Cluster Serial Programming Personal Computer

J parallel Adaptive Structured Mesh INfrastructure JASMIN http:://www.iapcm.ac.cn/jasmin，2010SR050446 2003-now JASMIN Structured Grid Inertial Confinement Fusion Global Climate Modeling CFD Material Simulations …… Particle Simulation Unstructured Grid

JASMIN V. 2.0 JASMIN User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces：Components based Parallel Programming models. ( C++ classes) Numerical Algorithms：geometry, fast solvers, mature numerical methods, time integrators, etc. HPC implementations( thousands of CPUs)：data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc. Architecture：Multilayered, Modularized, Object-oriented； Codes:C++/C/F90/F77＋MPI/OpenMP，500,000 lines； Installation: Personal computers, Cluster, MPP.

Numerical simulations on TianHe-1A Simulation duration : several hours to tens of hours.

Programming heterogeneous systems

GPU programming support • Source to source translation • Runtime optimization • Mixed programming model for multi-GPU systems

S2S translation for GPU A source-to-source translator, GPU-S2S, for GPU programming Facilitate the development of parallel programs on GPU by combining automatic mapping and static compilation

S2S translation for GPU (con’d) • Insert directives into the source program • Guide implicit call ofCUDA runtime libraries • Enable the user to control the mapping from the homogeneous CPU platform to GPU’s streaming platform • Optimization based on runtime profiling • Take full advantage of GPU according to the application characteristics by collecting runtime dynamic information.

The GPU-S2S architecture

Program translation by GPU-S2S

Runtime optimization based on profiling First level profiling (function level) Second level profiling (memory access and kernel improvement ) Third level profiling (data partition)

First level profiling Identify computing kernels Instrument the scan source code, get the execution time of every function, and identify computing kernel

Second level profiling Identify the memory access pattern and improve the kernels Instrument the computing kernels extract and analyze the profile information, optimize according to the feature of application, and finally generate the CUDA code with optimized kernel

Third level profiling Optimization by improve data partition Get copy time and computing time by instrumentation Compute the number of streams and data size of each stream Generate the optimized CUDA code with stream

Matrix multiplication Performance comparison before and after profile The CUDA code with three level profiling optimization achieves 31% improvement over the CUDA code with only memory access optimization, and 91% improvement over the CUDA code using only global memory for computing . Execution performance comparison on different platform

The CUDA code after three level profile optimization achieves 38% improvement over the CUDA code with memory access optimization, and 77% improvement over the CUDA code using only global memory for computing . FFT(1048576 points) Performance comparison before and after profile FFT(1048576 points ) execution performance comparison on different platform

The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system. MPI PGAS Using message passing or shared data for communication between parallel tasks or GPUs Programming Multi-GPU systems

Mixed Programming Model NVIDIA GPU —— CUDA Traditional Programming model —— MPI/UPC MPI+CUDA/UPC+CUDA CUDA program execution

MPI+CUDA experiment Platform ２NF5588 server, equipped with 1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture，4GB deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1

MPI+CUDA experiment (con’d) • Matrix Multiplication program • Using block matrix multiply for UPC programming. • Data spread on each UPC thread. • The computing kernel carries out the multiplication of two blocks at one time, using CUDA to implement. • The total time of execution：Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel Tcom: UPC thread communication time Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time

MPI+CUDA experiment (con’d) For 4094*4096，the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) is 184x of the case with 8 MPI task. For small scale data，such as 256，512 , the execution time of using 2 GPUs is even longer than using 1 GPUs the computing scale is too small , the communication between two tasks overwhelm the reduction of computing time. ２server，8 MPI task most １ server with 2 GPUs

PKU Manycore Software Research Group • Software tool development for GPU clusters • Unified multicore/manycore/clustering programming • Resilience technology for very-large GPU clusters • Software porting service • Joint project, <3k-line Code, supporting Tianhe • Advanced training program

PKU-TianheTurbulence Simulation Reach a scale 43 times higher than that of the Earth Simulator did 7168 nodes / 14336 CPUs / 7168 GPUs FFT speed: 1.6X of Jaguar Proof of feasibility of GPU speed up for large scale systems PKUFFT（using GPUs） MKL（not using GPUs） Jaguar

Advanced Compiler Technology

Advanced Compiler Technology (ACT) Group at the ICT, CAS ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawning) and multi-core processors (Loongson) Will lead the new multicore/many-core programming support project

PTA: Process-based TAsk parallel programming model • new process-based task construct • With properties of isolation, atomicity and deterministic submission • Annotate a loop into two parts, prologue and task segment • #pragma pta parallel [clauses] • #pragma pta task • #pragma pta propagate (varlist) • Suitable for expressing coarse-grained, irregular parallelism on loops • Implementation and performance • PTA compiler, runtime system and assistant tool (help writing correct programs) • Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores) • Code changes is within 10 lines, much smaller than OpenMP

Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC Enhance optimizations as localization and communication optimization Support SIMD intrinsics CUDA cluster：72% of hand-tuned version’s performance, code reduction to 68% Multi-core cluster: better process mapping and cache reuse than UPC UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies

OpenMP and Runtime Support for Heterogeneous Platforms Heterogeneous platforms consisting of CPUs and GPUs Multiple GPUs, or CPU-GPU cooperation brings extra data transfer hurting the performance gain Programmers need unified data management system OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing among computing devices Runtime support DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C)

Analyzers based on Compiling Techniques for MPI programs Communication slicing and process mapping tool Compiler part PDG Graph Building and slicing generation Iteration Set Transformation for approximation Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation Memory bandwidth measuring tool for MPI programs Detect the burst of bandwidth requirements Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime checking tool (MARMOT)

LoongCC: An Optimizing Compiler for Loongson Multicore Processors • Based on Open64-4.2 and supporting C/C++/Fortran • Open source at http://svn.open64.net/svnroot/open64/trunk/ • Powerful optimizer and analyzer with better performances • SIMD intrinsic support • Memory locality optimization • Data layout optimization • Data prefetching • Load/store grouping for 128-bit memory access instructions • Integrated with Aggressive Auto Parallelization Optimization (AAPO) module • Dynamic privatization • Parallel model with dynamic alias optimization • Array reduction optimization

Tools

Testing and evaluation of HPC systems • A center led by Tsinghua University (Prof. Wenguang Chen) • Developing accurate and efficient testing and evaluation tools • Developing benchmarks for HPC evaluation • Provide services to HPC developers and users

LSP3AS: large-scale parallel program performance analysis system • Designed for performance tuning on peta-scale HPC systems • Method: • Source code is instrumented • Instrumented code is executed, generating profiling&tracing data files • The profiling&tracing data is analyzed and visualization report is generated • Instrumentation: based on TAU from University of Oregon

Overview of Extreme-Scale Software Research in China