Cell processor implementation of a MILC lattice QCD application

Cell processor implementation of a MILC lattice QCD application Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb

Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/B.E. • PPU performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion

Introduction CPU MPI scatter/gather for loop 2 • Our target • MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm • Our view of the MILC applications • A sequence of communication and computation blocks compute loop 1 MPI scatter/gather for loop 3 compute loop 2 MPI scatter/gather for loop n+1 compute loop n MPI scatter/gather Original CPU-based implementation

Introduction • Cell/B.E. processor • One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage • 3.2 GHz processor • 25.6 GB/s processor-to-memory bandwidth • > 200 GB/s EIB sustained aggregate bandwidth • Theoretical peak performance: 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)

Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/B.E. • PPE performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion

Performance in PPE • Step 1: try to run it in PPE • In PPE it runs approximately ~2-3x slower than modern CPU • MILC is bandwidth-bound • It agrees with what we see with stream benchmark

Execution profile and kernels to be ported • 10 of these subroutines are responsible for >90% of overall runtime • All kernels responsible for 98.8%

lattice site 0 Data from neighbor Data accesses Kernel memory access pattern #define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++) FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); } • Kernel codes must be SIMDized • Performance determined by how fast you DMA in/out the data, not by SIMDized code • In each iteration, only small elements are accessed • Lattice size: 1832 bytes • su3_matrix: 72 bytes • wilson_vector: 96 bytes • Challenge: how to get data into SPUs as fast as possible? • Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes. • Data layout in MILC meets neither of them One sample kernel from udadu_mu_nu() routine

Approach I: packing and unpacking PPE and main memory SPEs struct site … Packing DMA operations … Unpacking … DMA operations • Good performance in DMA operations • Packing and unpacking are expensive in PPE struct site

SPEs … Original lattice … Modified lattice Continuous mem … DMA operations Approach II: Indirect memory access PPE and main memory • Replace elements in struct site with pointers • Pointers point to continuous memory regions • PPE overhead due to indirect memory access

SPEs PPE and memory … Original lattice … Lattice after padding … DMA operations Approach III: Padding and small memory DMAs • Padding elements to appropriate size • Padding struct site to appropriate size • Gained good bandwidth performance with padding overhead • Su3_matrix from 3x3 complex to 4x4 complex matrix • 72 bytes  128 bytes • Bandwidth efficiency lost: 44% • Wilson_vector from 4x3 complex to 4x4 complex • 98 bytes  128 bytes • Bandwidth efficiency lost: 23%

Struct site Padding • 128 byte stride access has different performance for different stride size • This is due to 16 banks in main memory • Odd numbers always reach peak • We choose to pad the struct site to 2688 (21*128) bytes

Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/BE. • PPU performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion

Kernel performance • GFLOPS are low for all kernels • Bandwidth is around 80% of peak for most of kernels • Kernel speedup compared to CPU for most of kernels are between 10x to 20x • set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x

Application performance 8x8x16x16 lattice • Single Cell Application performance speedup • ~8–10x, compared to Xeon single core • Cell Blade application performance speedup • 1.5-4.1x, compared to Xeon 2 socket 8 cores • Profile in Xeon • 98.8% parallel code, 1.2% serial code speedup slowdown • 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell PPE is standing in the way for further improvement 16x16x16x16 lattice

Application performance on two blades • For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet • More data needed for Cell blades connected through Infiniband

Application performance: a fair comparison • PPE is slower than Xeon • PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade

Conclusion • We achieved reasonably good performance • 4.5-5.0 Gflops in one Cell processor for whole application • We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion • Current site-centric data layout forces us to take the padding approach • 23-44% efficiency lost for bandwidth • Fix: field-centric data layout desired • PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E. • PPE may impose problems in scaling to multiple Cell blades • PPE over Infiniband test needed

Cell processor implementation of a MILC lattice QCD application

Cell processor implementation of a MILC lattice QCD application

Presentation Transcript

UKQCD software for lattice QCD

Lattice QCD

BARYON STRUCTURE FROM LATTICE QCD

Status of Lattice QCD

Designing Lattice QCD Clusters

Lattice QCD

LATTICE QCD AND FLAVOR PHYSICS

Lattice QCD

Lattice QCD Clusters

Lattice QCD Comes of Age y

Lattice QCD calculation of Nuclear Forces

Designing Lattice QCD Clusters

LATTICE QCD is FUN

Lattice results on QCD-strings

Simulation Algorithms for Lattice QCD

CELL Processor

Adaptive Multigrid for Lattice QCD

Lattice QCD (INTRODUCTION)

Implementation of a Network Processor

Lattice QCD and the QCD Vacuum Structure

Lattice QCD for Exotic Hadrons

Lattice QCD – a Decade from now.