New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

New Techniques for Programming GPU ClustersYifeng Chen School of EECSPeking University, China.

Two Conflicting Approaches for Programmability in HPC • Top-down Approach • Core programming model is high-level (e.g. func parallel lang) • Must rely on heavy heuristic runtime optimization • Add low-level program constructs to improve low-level control • Risks: • Programmers tend to avoid using “extra” constructs. • Low-level controls do not fit well into the core model. • Bottom-up Approach (PARRAY PPoPP’12) • Core programming model exposes the memory hierarchy • Same algorithm, Same performance, Same intellectual challenge, but Shorter code

GPUClusters Tianhe: 1 GPU/ 2CPUs Tsubame：3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1 CPU

Motivating Examples for PARRAY

Basic Notation Dimension Tree Type Reference

Thread Arrays

Generating CUDA+Pthread #parray {pthd [2]} P #parray {paged float [2][[2048][4096]]} H #parray {dmem float # H_1} D #parray {[#P][#D]} G float* host; _pa_pthd* p; #mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p) } pthread_create sem_post sem_wait pthread_join

Generating MPI or IB/verbs #parray { mpi [2] } M #parray { paged float [2][[2048][4096]] } H #parray { [#M][#H_1] } G float* host; _pa_mpi* m; #mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m) } MPI_Scatter

Other Communication Patterns ALLTOALL BCAST

Generating Code for IB/verbs and YH Communication Layer • Semi-Bypassing the MPI layer • Patching the Infiniband layer • Discontiguous RDMA communication pattern achieving Zero-Copy.

Large-Scale FFT in 20 lines Deeply optimized algorithm (ICS 2010) Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows • Scale • Up to 14336 3D Single-Precision • 12 distributed arrays, each with 11 TB data (128TB total) • Entire Tianhe-1A with 7168 nodes • Progress • 4096 3D completed • 8192 3D half-way • and 14336 3D tested for performance. • Software Technologies • PARRAY code only 300 lines. • Programming-level resilience technology for stable computation • Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.

Generated Code

Discussions • Other programming models? • MPI (more expressive datatype) • OpenACC (optimization for coalescing accesses) • PGAS (generating PGAS library calls) • IB/verbs (directly generating Zero-Copy IB calls) • We need a software stack! • Irregular structures must be encoded into arrays and then benefit from PARRAY. • Runtime workflow possible above PARRAY • Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macros • Macros are compiled out: no performance loss. • Typical training = 3 days, friendly to engineers…

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

Presentation Transcript

GPU programming

GPU Programming

Optimizations Techniques for GPU Computing

GPU Computing Techniques

GPU Programming

GPU Programming

net.pku/~course/cs410/2011/ Hongfei Yan School of EECS, Peking University 3/28/2011

PEKING UNIVERSITY

Bin Chen, Fengru Huang, Yu Fang Peking University

Gang Chen, Huan Chen China University of Geosciences, Wuhan Collaborators

POP: Peking University (PKU) Online Programming

Key Lab. Of Machine Perception, School of EECS, Peking University, China

Dong Xiaoying, Ph.D Guanghua School of Management Peking University

Wei Zheng, Peking University Yongdong Liu, China Academy of Sciences

Cuie Wu School of Physics, Peking University

GPU Programming

Applying to Peking University HSBC Business School

Yajun Mao ( School of Physics, Peking University)

Ping Chen China Center for Economic Research, Peking University pchen@ccer.pku

Bin Chen, Fengru Huang, Yu Fang Peking University

Gang Chen, Huan Chen China University of Geosciences, Wuhan Collaborators