1 / 18

Cell processor implementation of a MILC lattice QCD application

Cell processor implementation of a MILC lattice QCD application. Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb. Presentation outline. Introduction Our view of MILC applications Introduction to Cell Broadband Engine Implementation in Cell/B.E. PPU performance and stream benchmark

zev
Download Presentation

Cell processor implementation of a MILC lattice QCD application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cell processor implementation of a MILC lattice QCD application Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb

  2. Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/B.E. • PPU performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion

  3. Introduction CPU MPI scatter/gather for loop 2 • Our target • MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm • Our view of the MILC applications • A sequence of communication and computation blocks compute loop 1 MPI scatter/gather for loop 3 compute loop 2 MPI scatter/gather for loop n+1 compute loop n MPI scatter/gather Original CPU-based implementation

  4. Introduction • Cell/B.E. processor • One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage • 3.2 GHz processor • 25.6 GB/s processor-to-memory bandwidth • > 200 GB/s EIB sustained aggregate bandwidth • Theoretical peak performance: 204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)

  5. Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/B.E. • PPE performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion

  6. Performance in PPE • Step 1: try to run it in PPE • In PPE it runs approximately ~2-3x slower than modern CPU • MILC is bandwidth-bound • It agrees with what we see with stream benchmark

  7. Execution profile and kernels to be ported • 10 of these subroutines are responsible for >90% of overall runtime • All kernels responsible for 98.8%

  8. lattice site 0 Data from neighbor Data accesses Kernel memory access pattern #define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++) FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); } • Kernel codes must be SIMDized • Performance determined by how fast you DMA in/out the data, not by SIMDized code • In each iteration, only small elements are accessed • Lattice size: 1832 bytes • su3_matrix: 72 bytes • wilson_vector: 96 bytes • Challenge: how to get data into SPUs as fast as possible? • Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes. • Data layout in MILC meets neither of them One sample kernel from udadu_mu_nu() routine

  9. Approach I: packing and unpacking PPE and main memory SPEs struct site … Packing DMA operations … Unpacking … DMA operations • Good performance in DMA operations • Packing and unpacking are expensive in PPE struct site

  10. SPEs … Original lattice … Modified lattice Continuous mem … DMA operations Approach II: Indirect memory access PPE and main memory • Replace elements in struct site with pointers • Pointers point to continuous memory regions • PPE overhead due to indirect memory access

  11. SPEs PPE and memory … Original lattice … Lattice after padding … DMA operations Approach III: Padding and small memory DMAs • Padding elements to appropriate size • Padding struct site to appropriate size • Gained good bandwidth performance with padding overhead • Su3_matrix from 3x3 complex to 4x4 complex matrix • 72 bytes  128 bytes • Bandwidth efficiency lost: 44% • Wilson_vector from 4x3 complex to 4x4 complex • 98 bytes  128 bytes • Bandwidth efficiency lost: 23%

  12. Struct site Padding • 128 byte stride access has different performance for different stride size • This is due to 16 banks in main memory • Odd numbers always reach peak • We choose to pad the struct site to 2688 (21*128) bytes

  13. Presentation outline • Introduction • Our view of MILC applications • Introduction to Cell Broadband Engine • Implementation in Cell/BE. • PPU performance and stream benchmark • Profile in CPU and kernels to be ported • Different approaches • Performance • Conclusion

  14. Kernel performance • GFLOPS are low for all kernels • Bandwidth is around 80% of peak for most of kernels • Kernel speedup compared to CPU for most of kernels are between 10x to 20x • set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x

  15. Application performance 8x8x16x16 lattice • Single Cell Application performance speedup • ~8–10x, compared to Xeon single core • Cell Blade application performance speedup • 1.5-4.1x, compared to Xeon 2 socket 8 cores • Profile in Xeon • 98.8% parallel code, 1.2% serial code speedup slowdown • 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell PPE is standing in the way for further improvement 16x16x16x16 lattice

  16. Application performance on two blades • For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet • More data needed for Cell blades connected through Infiniband

  17. Application performance: a fair comparison • PPE is slower than Xeon • PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade

  18. Conclusion • We achieved reasonably good performance • 4.5-5.0 Gflops in one Cell processor for whole application • We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion • Current site-centric data layout forces us to take the padding approach • 23-44% efficiency lost for bandwidth • Fix: field-centric data layout desired • PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E. • PPE may impose problems in scaling to multiple Cell blades • PPE over Infiniband test needed

More Related