Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

Impact of Compiler-based Data-Prefetching Techniqueson SPEC OMP Application Performance 2005-23523 이영준(Lee Young Joon ) MSL, EECS, SNU 2006.06.07.

Contents • Introduction • Intel Compiler Overview • Prefetching for Itanium 2 Processor • Experimental Evaluations • Concluding Remarks • Reference MSL, EECS, SNU

1. Introduction • The memory wall challenge • the processor-memory speed gap • Remedy • Latency tolerance • Software data-prefetching • Latency elimination • Long latency elimination techniques (locality optimizations) MSL, EECS, SNU

In this paper, • Examine the impact of software data-prefetching on SPEC OMP application • OpenMP application performance on shared memory system • OpenMP C++/C and Fortran 2.0 standards • using Intel C++ and Fortran compilers • on an SGI Altix 32-way SMP machine built with Itanium 2 processors • Most of compiler analyses and optimization are done before data-prefetching stage • utilizing the services of an advanced memory disambiguation module • pointer analysis, address-taken analysis, array dependence analysis, language semantics, and other sources MSL, EECS, SNU

2. Intel Compiler Overview • Intel Itanium 2 processor has new architectural and micro-architectural features • Intel Itanium compiler takes advantage of it • EPIC (Explicitly Parallel Instruction Computing) for large amounts of ILP • Control and data speculation • allowing loads to be scheduled across branches or other memory operations • Predication MSL, EECS, SNU

Intel Compiler Features • Supports both automatic optimization and programmer-controlled methods • Advanced compiler technologies • profile-guided multi-file inter-procedural analysis and optimizations • memory disambiguation/optimizations • parallelization • data and loop transformations • global code scheduling • predication, speculation • User can utilize multiprocessor • by making small changes to the source code • ex) OpenMP directives MSL, EECS, SNU

Compiler Optimizations • Compiler optimizations in Intel compiler • Multi-Level Parallelism (MLP) • Instruction Level Parallelism + Thread Level Parallelism • Inter-Procedural Optimization (IPO) • points-to analysis (helps memory disambiguation), mod/ref analysis • High-Level Optimization (HLO) • loop transformation (loop fusion, loop tiling, loop unroll-and jam, loop distribution), software data prefetching, scalar replacement, data transformations • improve data locality and reduce memory access latency • Scalar Optimizations • branch-merging, strength reduction, constant propagation, dead code elimination, copy propagation, partial dead store elimination, and partial redundancy elimination (PRE) • Task Queuing Model • to exploit irregular parallelism effectively • extends scope beyond standard OpenMP programming model MSL, EECS, SNU

3. Prefetching for Itanium 2 Processor • Software data-prefetching • hide memory access latency • by moving referenced data closer to the CPU • do not block instruction stream • do not raise an exeption • Software data prefetching in Intel compiler takes advantage of Itanium 2 architectural features • prediction • rotating registers • data speculation MSL, EECS, SNU

Rotating Registers r127 • enables succinct implementation of software pipelining with predication • rotated by one register position each time one of the special loop branches is executed • after one rotation, the content of register X will be found in register X+1 • r32-r127, f32-f127, p16-p63 (predicate regs) • others do not rotate (static registers) rotating (selectable) rX+1 rX r32 not rotating (static) r31 r0 r31 r0 MSL, EECS, SNU

Prefetch Principles • Avoid already loaded data • already in cache • Issue at the right time • early so that it is available • lately so that it is not evicted • Prefetch distance • estimated based on memory latency, resource requirements, data-dependence info. prefetch request cache eviction MSL, EECS, SNU

Data-locality analysis • Three types of data-locality that are identified in Intel compiler • Spatial locality • if data references inside a loop access different memory locations that fall within the same cache line • Temporal locality • if a data reference accesses the same memory location multiple times • Group locality • if different data references access the same cache line MSL, EECS, SNU

An Example of Data-Prefetching • spatial locality: x(0), ..., x(99), and y(-1), ..., y(100) • group locality: y(k-1), y(k+1) - w.r.t. k loop iterations • if() statement can be replaced by predication • control dependence -> data dependence • reduces branch misprediction penalty • If cache line size=128B, array element size=8B • prefetch distance: D = 16 iterations • calculated by compiler • Assume k=0, D=8 • If the array elements x(k+D) and y(k-1+D) are prefetched, array accesses to x(9:15) and y(8:14) will hit the cache MSL, EECS, SNU

Other... • Large number of registers • store memory addresses of prefetching in registers • no need for register spill and fill within loops • Itanium 2 architecture supports memory access hints • ex) if a data reference will not be reused, avoid cache pollution - lfetch 'nta' hint • These features support compiler to do better data reuse analysis on data movement across loop bodies • can avoid unnecessary prefetches MSL, EECS, SNU

4. Experimental Evaluation4.1. Methodology • the SPEC OMPM2001 benchmark suite • consists of a set of OpenMP based application programs • input reference data sets are derived from scientific computations on SMP systems • 11 large application programs • 8 in Fortran, 3 in C • requires a virtual address space of 2GB • larger than SPEC CPU2000 • can run in a 32-bit address space MSL, EECS, SNU

Experimental System • SGI Altix3000 system • a distributed shared memory (DSM) architecture • NUMAflex (NUMA 3) • a global-address-space, cache-coherent multiprocessor • ccNUMA • 32 Intel Itanium 2 1.5GHz processors • each CPU has 16KBI+16KBD L1 cache • 256KB on chip L2 cache • 6MB on chip L3 cache • 256GB memory per 4-CPU module • OS: SGI ProPack v3 • The compiler: Intel C++ and Fortran95 compiler version 8.1 beta • All experiments use 32 threads mapped onto 32 processors (one CPU per thread) MSL, EECS, SNU

R-brick crossbar switch C-brick a.k.a. compute node Super-Bedrock ASIC Direct connection to I/O 6.4GB/s each Router ASIC SGI Altix3000 Block Diagram MSL, EECS, SNU

4.2. Impact of Software Data-Prefetching • By enabling data-prefetching in the Intel compiler with optimization such as • parallelization, privatization • loop transformations • IPO • scalar replacement • prefetching • software pipelining • The data-prefetching phase is after all these optimization • can benefit from the previous ones, more effective prefetching • The interaction between these optimization is very complex MSL, EECS, SNU

Performance gain with software data-prefetching • 314.mgrid_m: almost 100% gain • 6 others: greater than 10% • 332.ammp_m: less than 1% gain • We discuss in detail in the following MSL, EECS, SNU

4.3. Impact of Prefetching for Loads Only • For applications that are memory bounded, load/store dominates performance • adding extra prefetch will increase the pressure on memory channel • On SMP system, avoiding resource contention on the memory system is important • Experiment: issuing prefetches only for memory references that are loads • and compared the result with the full prefetching for loads and stores version MSL, EECS, SNU

Gain/loss with prefetching loads only • Reduced memory bandwidth pressure → performance gain • 312.swim_m and 314.mgrid_m • memory bandwidth bound applications • with a lot of streaming data accesses • For not memory bounded programs, performance loss • due to the memory latency of stores • Not a general applicable scheme for most application • The geometric mean: 0.06% MSL, EECS, SNU

4.4. Prefetching for Spatial Locality • When a cache line is filled • it contains a number of elements of an array • For a data reference with spatial locality • need only one prefetch instruction in several iterations • no cache misses for this memory reference MSL, EECS, SNU

Gain/loss with prefetching for referencesexhibiting spatial locality • Geometric mean: 21.89% performance gain • 332.ammp_m: a slowdown • due to a noise of performance measurement (OS thread scheduling) • compared to section 4.2, this contributes 73.09% of the total gain • usual loops exhibit spatial locality • compiler should take advantage of it MSL, EECS, SNU

4.5. Prefetching using Rotating Registers • Itanium 2 process has rotating registers • Register rotation provides a hardware renaming mechanism • helps the compiler to control prefetching with minimum overhead • Clever scheme of optimizing software data-prefetching • reduces the number of issue slots for prefetch instructions • avoids branch mispredict penalties • conditionals or predicats computation • avoids the need for loop unrolling • Note: some of the prefetches will be redundant (same cache line) MSL, EECS, SNU

Gain/Loss of Prefetching using Rotating Registers • baseline performance • without using the rotating register scheme • using a conditional statement inside the loop • may get predicated by the compiler • Geometric mean: 2.71% gain • prefetching using the rotating registers scheme brings a positive impact MSL, EECS, SNU

4.6. Prefetching for Spatial References with No Predication • Almost all Itanium 2 process instructions have a qualifying predicate • 64 predicate registers: p0-p63 • Rotating predicate registers • to avoid overwriting a predicate value that is still alive • to control the filling and draining of a software pipelining • To prefetch spatially-local references, compiler minimizes redundant prefetches and avoid branch mispredict penalties • Note: the rotating register technique works only for software pipelined loops MSL, EECS, SNU

Gain/Loss of prefetching with no predication • 7 out of 11 applications achieved performance gains • Itanium 2 processor discards redundant prefetch instructions • 324.apsi_m: 13.25% gain • because it is memory bounded program MSL, EECS, SNU

4.7. Impact of Prefetching References with No Spatial Locality • The overheads of prefetching memory references in loops that exhibit no spatial locality • prefetched in every iteration of the loop • cannot be optimized using predication or rotating registers • We measure the performance impact of prefetching for such data references MSL, EECS, SNU

Gain/loss of prefetching references with no locality • 8 out of 11 applications achieved gain • 320.equake_m: a nice performance boost • with this aggressive prefetching scheme MSL, EECS, SNU

4.8. Impact of Prefetching for Outer Loops • Generally, prefetching is applied to inner-loop • most performance gains • Some applications may get a benefit by issuing prefetches for references that appear in outer loops MSL, EECS, SNU

Gain/loss with prefetching for outer loop • 7 out of 11 get a performance degradation • Geometric mean gets -0.48% degradation • Enabling prefetching for outer loops brings minor negative impacts MSL, EECS, SNU

4.9. Prefetching Arrays with Indirect Indexing • For an indexed array reference (ex: a[b[i]]), the memory indirection requires a sophisticated prefetching strategy • If cache misses occur for both the index array and the data array, prefetches have to be issued for both references • The distance for index array should be larger than the distance for data array • to ensure that the index array does not encounter cache miss • Data speculation support in Itanium 2 is used to load the index array while computing this address for the data array • any out-of-bound accesses of the index array are silently ignored (no exception) • Generally this kind of prefetching is useful for application with irregular memory access patterns MSL, EECS, SNU

Prefetching memory with indirect array indexing • the impact is minor for SPEC OMPM2001 suite • this is not the most valuable scheme for SPEC OMPM2001 MSL, EECS, SNU

Concluding Remarks • This is the first paper that studies the impact of software data-prefetching on the SPEC OMPM2001 • Software data-prefetching is an effective compiler technique for tolerating the long memory latency without notably increasing the memory traffic • Most of the performance gain is from • prefetching for memory accesses exhibiting spatial locality • prefetching for array references with no spatial locality • prefetching using rotating registers • It remains to be seen if more advanced data-prefetching can bring performance gain on the Intel Itanium 2 processor based SMP systems MSL, EECS, SNU

References • STREAM, http://www.cs.virginia.edu/stream/ • SGI Altix 3000 User’s Guide • Intel Itanium Architecture Software Developer’s Manual - Volume 3: Instruction Set Reference MSL, EECS, SNU

Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

Presentation Transcript

Advanced Compiler Techniques

New (Applications of) Compiler Techniques for Data Grids

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Advanced Compiler Techniques

Mass Spec Application

Advanced Compiler Techniques

Analyzing the Impact of Data Prefetching on Chip MultiProcessors

Spec of Lifelog data

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms

Multi-level Adaptive Prefetching based on Performance Gradient Tracking

New (Applications of) Compiler Techniques for Data Grids

Application-level Prefetching

Program Performance through Compiler Optimization Techniques

Prefetching Techniques

SPEC OMP Benchmark Suite

Lecture 25: Advanced Data Prefetching Techniques

Precomputation-based Prefetching

Multi-level Adaptive Prefetching based on Performance Gradient Tracking