Nikola Vuji ć nvujic @bsc.es

Barcelona Supercomputing Center Department of Computer Architecture, Technical University of Catalonia Architecture-AgnosticParallelProgramming in Supercomputing Research Xavier Martorell xavim@ac.upc.edu Eduard Ayguadé eduard@ac.upc.edu Nikola Vujić nvujic@bsc.es Marc Gonzàlez marc@ac.upc.edu Belgrade, April 2-5,2009 VIPSI-2009

Introduction • Moore’s Law – one of the guiding principles of computer architecture • Technology scaling will continue • Processors generations • 1. Generation (1980s and 1990s): single cores • 2. Generation (2001): homogeneous multi-cores • 3. Generation (2005): heterogeneous multi-cores • Power density of modern multiprocessors is reaching detrimental levels • Power is becoming a main challenge for performance of new processors. • Heterogeneity in a multi-core design appears as a solution for power efficient processors • CELL BE Processor, AMD Fusion, Intel Larabee. VIPSI 2009, Belgrade, April 2-5, 2009 / 5

CELL BE ARCHITECTURE • The Cell BE Architecture – hybrid multi-core design that mixes two architectures • Power Processor Element (PPE) • Synergistic Processor Elements (SPEs) • Cell BE processor is used in Sony Playstation 3 for gaming applications and in Supercomputing for scientific applications • Cell-based clusters as of late 2008 dominate the Green500 list of themost energy-efficient supercomputers. • The IBM Roadrunner supercomputer currently the world’s fastest, consists of exactly 12,2400 PowerXCell 8i processors, along with 6,562 AMD Opteron processors. • Roadrunner is the one of the first ten the most energy-efficient supercomputers in the world. VIPSI 2009, Belgrade, April 2-5, 2009 / 5

Transform original code • Allocate buffers in the local stores • Introduce DMA operations within the code • Synchronize statements • Translate from original address space to local address space • Automatic solution • Tiling, Double buffering • Good solution for regular applications • Needs of considerable information in compile time • Software Cache • Usually performance is limited to the available information in compile time • Very difficult to generate the code that overlaps communication with computation • Emulating hardware caches by software technique PROGRAMMABILITY but not PERFORMANCE PERFORMANCE but not PROGRAMMABILITY • Manual solution • Very optimized code but in cost of programmability • Overlap of communication with computation Programmability versus Performance • Software control of DMA transfers provides concurrency between data access and computation while making efficient use of the available memory bandwidth but it adds burden of explicitly programming DMA transfers and decreases programmability. VIPSI 2009, Belgrade, April 2-5, 2009 / 5

Optimization of High Locality Accesses i=0; i=0; while (i<N){ while (i<N){ for (i=0; i<N; i++) { for (i=0; i<N; i++) { type n = N; n = N; type High Locality Cache address Transactional Cache address t t mp = index[i]; mp = index[i]; r1 AVAIL AVAIL if (! if (! (h1, & (h1, & index[i index[i ])) ])) address address HMAP HMAP r1 r1 (h1, & (h1, & index[i index[i ]); ]); w[tmp] = v[i]; w[tmp] = v[i]; r2 r3 AVAIL AVAIL n = n = min(n min(n , i+ , i+ (h1, & (h1, & index[i index[i ]); ]); address stride counter type address address v[i]++; v[i]++; r3 AVAIL AVAIL if (! if (! (h3, & (h3, & v[i v[i ])) ])) } } Memory Consistency Block HMAP HMAP (h3, & (h3, & v[i v[i ]); ]); r3 r3 AVAIL AVAIL n = n = min(n min(n , i+ , i+ (h3, & (h3, & v[i v[i ]); ]); HCONSISTENCY HCONSISTENCY (n (n , h3); , h3); HSYNC HSYNC (h1, h3); (h1, h3); for (;i< for (;i< n;i n;i ++){ ++){ tmp tmp = REF(h1, & = REF(h1, & index[i index[i ]); ]); r1 r1 loop loop - - r3 r3 w[tmp w[tmp ] = REF(h3, & ] = REF(h3, & v[i v[i ]); ]); Inner Inner r3 r3 REF(h3, & REF(h3, & v[i v[i ])=REF(h3, &v[i])+1; ])=REF(h3, &v[i])+1; } } } } SPE Code Transformation • SPE code transformation is done in three phases Original SPE Code Optimization of Irregular Accesses for(;i<2*[n/2];i+=2){ for(;i<2*[n/2];i+=2){ TINIT_PF TINIT_PF (); (); r1 r1 REF REF tmp = tmp = (h1, &index[i]); (h1, &index[i]); r2 r2 GET GET (h2, &w[tmp]); (h2, &w[tmp]); REF REF tmp tmp ’ ’ = = (h1, &index[i+1]); (h1, &index[i+1]); r1 r1 GET GET r2 r2 ’ ’ (h2 (h2 ’ ’ , &w[tmp , &w[tmp ’ ’ ]); ]); TSYNC TSYNC (h2, h2 (h2, h2 ’ ’ ); ); REF REF REF REF r3 r3 (h2, &w[tmp]) = (h2, &w[tmp]) = (h3, &v[i]); (h3, &v[i]); r2 r2 PUT PUT (h2, &w[tmp]); (h2, &w[tmp]); REF REF REF REF (h3, &v[i]) = (h3, &v[i]) = (h3, &v[i])+1; (h3, &v[i])+1; r3 r3 REF REF REF REF r3 r3 (h2 (h2 ’ ’ , &w[tmp , &w[tmp ’ ’ ]) = ]) = (h3, &v[i+1]); (h3, &v[i+1]); r2 r2 ’ ’ PUT PUT (h2 (h2 ’ ’ , &w[tmp , &w[tmp ’ ’ ]); ]); REF REF REF REF (h3, &v[i+1]) = (h3, &v[i+1]) = (h3, &v[i+1])+1; (h3, &v[i+1])+1; r3 r3 } } • Marc Gonzalez, Nikola Vujic, Xavier Martorell, Eduard Ayguade, Alexander E. Eichenberger, Tong Chen, Zehra Sura, Tao Zhang, Kevin O’Brien and Kathryn O’Brien, “Hybrid Access-Specific Software-Cache Techniques for the Cell BE Architecture”, Proceedings of the Seventeenth International Conference on Parallel Architectures and Compilation Techniques (PACT 2008) • Nikola Vujic, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, “Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture”,Proceedings of the 21st Annual Workshop on Languages and Compilers for Parallel Computing (LCPC 2008) VIPSI 2009, Belgrade, April 2-5, 2009 / 5

Performance and Programmability • Performance comparison of the Cell BE with PowerPC 970MP and Power5 • Marenostrum supercomputer consist of 10,240 PowerPC 970MP processors. • Power5 has been optimized for numerical applications such as the ones found in the NAS benchmark. BladeCenter JS21 BladeCenter QS21 • Shared memory model with a relaxed consistency model is transparently offered to the programmers. • The software cache design makes the Cell BE programmable as a cache-based multi-core processor having similar performance in return. • Cell matches the computational power of a PowerPC 970MP and Power5 for a set of NAS parallel benchmarks. VIPSI 2009, Belgrade, April 2-5, 2009 /5

Barcelona Supercomputing Center Department of Computer Architecture, Technical University of Catalonia Architecture-AgnosticParallelProgramming in Supercomputing Research Questions? Nikola Vujić nikola.vujic@bsc.es Belgrade, April 2-5,2009 VIPSI-2009 / 5

Nikola Vuji ć nvujic @bsc.es

Nikola Vuji ć nvujic @bsc.es

Presentation Transcript

Nikola Tesla

Digital System Clocking:

Croatia and Interventional Radiology

Embedded Systems

sensitivity analysis ken goldberg uc berkeley