190 likes | 308 Views
New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China. Two Conflicting Approaches for Programmability in HPC. Top-down Approach Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization
E N D
New Techniques for Programming GPU ClustersYifeng Chen School of EECSPeking University, China.
Two Conflicting Approaches for Programmability in HPC • Top-down Approach • Core programming model is high-level (e.g. func parallel lang) • Must rely on heavy heuristic runtime optimization • Add low-level program constructs to improve low-level control • Risks: • Programmers tend to avoid using “extra” constructs. • Low-level controls do not fit well into the core model. • Bottom-up Approach (PARRAY PPoPP’12) • Core programming model exposes the memory hierarchy • Same algorithm, Same performance, Same intellectual challenge, but Shorter code
GPUClusters Tianhe: 1 GPU/ 2CPUs Tsubame:3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1 CPU
Basic Notation Dimension Tree Type Reference
Generating CUDA+Pthread #parray {pthd [2]} P #parray {paged float [2][[2048][4096]]} H #parray {dmem float # H_1} D #parray {[#P][#D]} G float* host; _pa_pthd* p; #mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p) } pthread_create sem_post sem_wait pthread_join
Generating MPI or IB/verbs #parray { mpi [2] } M #parray { paged float [2][[2048][4096]] } H #parray { [#M][#H_1] } G float* host; _pa_mpi* m; #mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m) } MPI_Scatter
Other Communication Patterns ALLTOALL BCAST
Generating Code for IB/verbs and YH Communication Layer • Semi-Bypassing the MPI layer • Patching the Infiniband layer • Discontiguous RDMA communication pattern achieving Zero-Copy.
Large-Scale FFT in 20 lines Deeply optimized algorithm (ICS 2010) Zero-copy for hmem
Direct Simulation of Turbulent Flows • Scale • Up to 14336 3D Single-Precision • 12 distributed arrays, each with 11 TB data (128TB total) • Entire Tianhe-1A with 7168 nodes • Progress • 4096 3D completed • 8192 3D half-way • and 14336 3D tested for performance. • Software Technologies • PARRAY code only 300 lines. • Programming-level resilience technology for stable computation • Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.
Discussions • Other programming models? • MPI (more expressive datatype) • OpenACC (optimization for coalescing accesses) • PGAS (generating PGAS library calls) • IB/verbs (directly generating Zero-Copy IB calls) • We need a software stack! • Irregular structures must be encoded into arrays and then benefit from PARRAY. • Runtime workflow possible above PARRAY • Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macros • Macros are compiled out: no performance loss. • Typical training = 3 days, friendly to engineers…