180 likes | 342 Views
Cell processor implementation of a MILC lattice QCD application. Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb. Presentation outline. Introduction Our view of MILC applications Introduction to Cell Broadband Engine Implementation in Cell/B.E. PPU performance and stream benchmark
E N D
Cell processor implementation of a MILC lattice QCD application Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb
Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/B.E. • PPU performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion
Introduction CPU MPI scatter/gather for loop 2 • Our target • MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm • Our view of the MILC applications • A sequence of communication and computation blocks compute loop 1 MPI scatter/gather for loop 3 compute loop 2 MPI scatter/gather for loop n+1 compute loop n MPI scatter/gather Original CPU-based implementation
Introduction • Cell/B.E. processor • One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage • 3.2 GHz processor • 25.6 GB/s processor-to-memory bandwidth • > 200 GB/s EIB sustained aggregate bandwidth • Theoretical peak performance: 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)
Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/B.E. • PPE performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion
Performance in PPE • Step 1: try to run it in PPE • In PPE it runs approximately ~2-3x slower than modern CPU • MILC is bandwidth-bound • It agrees with what we see with stream benchmark
Execution profile and kernels to be ported • 10 of these subroutines are responsible for >90% of overall runtime • All kernels responsible for 98.8%
lattice site 0 Data from neighbor Data accesses Kernel memory access pattern #define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++) FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); } • Kernel codes must be SIMDized • Performance determined by how fast you DMA in/out the data, not by SIMDized code • In each iteration, only small elements are accessed • Lattice size: 1832 bytes • su3_matrix: 72 bytes • wilson_vector: 96 bytes • Challenge: how to get data into SPUs as fast as possible? • Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes. • Data layout in MILC meets neither of them One sample kernel from udadu_mu_nu() routine
Approach I: packing and unpacking PPE and main memory SPEs struct site … Packing DMA operations … Unpacking … DMA operations • Good performance in DMA operations • Packing and unpacking are expensive in PPE struct site
SPEs … Original lattice … Modified lattice Continuous mem … DMA operations Approach II: Indirect memory access PPE and main memory • Replace elements in struct site with pointers • Pointers point to continuous memory regions • PPE overhead due to indirect memory access
SPEs PPE and memory … Original lattice … Lattice after padding … DMA operations Approach III: Padding and small memory DMAs • Padding elements to appropriate size • Padding struct site to appropriate size • Gained good bandwidth performance with padding overhead • Su3_matrix from 3x3 complex to 4x4 complex matrix • 72 bytes 128 bytes • Bandwidth efficiency lost: 44% • Wilson_vector from 4x3 complex to 4x4 complex • 98 bytes 128 bytes • Bandwidth efficiency lost: 23%
Struct site Padding • 128 byte stride access has different performance for different stride size • This is due to 16 banks in main memory • Odd numbers always reach peak • We choose to pad the struct site to 2688 (21*128) bytes
Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/BE. • PPU performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion
Kernel performance • GFLOPS are low for all kernels • Bandwidth is around 80% of peak for most of kernels • Kernel speedup compared to CPU for most of kernels are between 10x to 20x • set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x
Application performance 8x8x16x16 lattice • Single Cell Application performance speedup • ~8–10x, compared to Xeon single core • Cell Blade application performance speedup • 1.5-4.1x, compared to Xeon 2 socket 8 cores • Profile in Xeon • 98.8% parallel code, 1.2% serial code speedup slowdown • 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell PPE is standing in the way for further improvement 16x16x16x16 lattice
Application performance on two blades • For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet • More data needed for Cell blades connected through Infiniband
Application performance: a fair comparison • PPE is slower than Xeon • PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade
Conclusion • We achieved reasonably good performance • 4.5-5.0 Gflops in one Cell processor for whole application • We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion • Current site-centric data layout forces us to take the padding approach • 23-44% efficiency lost for bandwidth • Fix: field-centric data layout desired • PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E. • PPE may impose problems in scaling to multiple Cell blades • PPE over Infiniband test needed