350 likes | 638 Views
Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance. 2005-23523 이영준 (Lee Young Joon ) MSL, EECS, SNU 2006.06.07. Contents. Introduction Intel Compiler Overview Prefetching for Itanium 2 Processor Experimental Evaluations Concluding Remarks Reference.
E N D
Impact of Compiler-based Data-Prefetching Techniqueson SPEC OMP Application Performance 2005-23523 이영준(Lee Young Joon ) MSL, EECS, SNU 2006.06.07.
Contents • Introduction • Intel Compiler Overview • Prefetching for Itanium 2 Processor • Experimental Evaluations • Concluding Remarks • Reference MSL, EECS, SNU
1. Introduction • The memory wall challenge • the processor-memory speed gap • Remedy • Latency tolerance • Software data-prefetching • Latency elimination • Long latency elimination techniques (locality optimizations) MSL, EECS, SNU
In this paper, • Examine the impact of software data-prefetching on SPEC OMP application • OpenMP application performance on shared memory system • OpenMP C++/C and Fortran 2.0 standards • using Intel C++ and Fortran compilers • on an SGI Altix 32-way SMP machine built with Itanium 2 processors • Most of compiler analyses and optimization are done before data-prefetching stage • utilizing the services of an advanced memory disambiguation module • pointer analysis, address-taken analysis, array dependence analysis, language semantics, and other sources MSL, EECS, SNU
2. Intel Compiler Overview • Intel Itanium 2 processor has new architectural and micro-architectural features • Intel Itanium compiler takes advantage of it • EPIC (Explicitly Parallel Instruction Computing) for large amounts of ILP • Control and data speculation • allowing loads to be scheduled across branches or other memory operations • Predication MSL, EECS, SNU
Intel Compiler Features • Supports both automatic optimization and programmer-controlled methods • Advanced compiler technologies • profile-guided multi-file inter-procedural analysis and optimizations • memory disambiguation/optimizations • parallelization • data and loop transformations • global code scheduling • predication, speculation • User can utilize multiprocessor • by making small changes to the source code • ex) OpenMP directives MSL, EECS, SNU
Compiler Optimizations • Compiler optimizations in Intel compiler • Multi-Level Parallelism (MLP) • Instruction Level Parallelism + Thread Level Parallelism • Inter-Procedural Optimization (IPO) • points-to analysis (helps memory disambiguation), mod/ref analysis • High-Level Optimization (HLO) • loop transformation (loop fusion, loop tiling, loop unroll-and jam, loop distribution), software data prefetching, scalar replacement, data transformations • improve data locality and reduce memory access latency • Scalar Optimizations • branch-merging, strength reduction, constant propagation, dead code elimination, copy propagation, partial dead store elimination, and partial redundancy elimination (PRE) • Task Queuing Model • to exploit irregular parallelism effectively • extends scope beyond standard OpenMP programming model MSL, EECS, SNU
3. Prefetching for Itanium 2 Processor • Software data-prefetching • hide memory access latency • by moving referenced data closer to the CPU • do not block instruction stream • do not raise an exeption • Software data prefetching in Intel compiler takes advantage of Itanium 2 architectural features • prediction • rotating registers • data speculation MSL, EECS, SNU
Rotating Registers r127 • enables succinct implementation of software pipelining with predication • rotated by one register position each time one of the special loop branches is executed • after one rotation, the content of register X will be found in register X+1 • r32-r127, f32-f127, p16-p63 (predicate regs) • others do not rotate (static registers) rotating (selectable) rX+1 rX r32 not rotating (static) r31 r0 r31 r0 MSL, EECS, SNU
Prefetch Principles • Avoid already loaded data • already in cache • Issue at the right time • early so that it is available • lately so that it is not evicted • Prefetch distance • estimated based on memory latency, resource requirements, data-dependence info. prefetch request cache eviction MSL, EECS, SNU
Data-locality analysis • Three types of data-locality that are identified in Intel compiler • Spatial locality • if data references inside a loop access different memory locations that fall within the same cache line • Temporal locality • if a data reference accesses the same memory location multiple times • Group locality • if different data references access the same cache line MSL, EECS, SNU
An Example of Data-Prefetching • spatial locality: x(0), ..., x(99), and y(-1), ..., y(100) • group locality: y(k-1), y(k+1) - w.r.t. k loop iterations • if() statement can be replaced by predication • control dependence -> data dependence • reduces branch misprediction penalty • If cache line size=128B, array element size=8B • prefetch distance: D = 16 iterations • calculated by compiler • Assume k=0, D=8 • If the array elements x(k+D) and y(k-1+D) are prefetched, array accesses to x(9:15) and y(8:14) will hit the cache MSL, EECS, SNU
Other... • Large number of registers • store memory addresses of prefetching in registers • no need for register spill and fill within loops • Itanium 2 architecture supports memory access hints • ex) if a data reference will not be reused, avoid cache pollution - lfetch 'nta' hint • These features support compiler to do better data reuse analysis on data movement across loop bodies • can avoid unnecessary prefetches MSL, EECS, SNU
4. Experimental Evaluation4.1. Methodology • the SPEC OMPM2001 benchmark suite • consists of a set of OpenMP based application programs • input reference data sets are derived from scientific computations on SMP systems • 11 large application programs • 8 in Fortran, 3 in C • requires a virtual address space of 2GB • larger than SPEC CPU2000 • can run in a 32-bit address space MSL, EECS, SNU
Experimental System • SGI Altix3000 system • a distributed shared memory (DSM) architecture • NUMAflex (NUMA 3) • a global-address-space, cache-coherent multiprocessor • ccNUMA • 32 Intel Itanium 2 1.5GHz processors • each CPU has 16KBI+16KBD L1 cache • 256KB on chip L2 cache • 6MB on chip L3 cache • 256GB memory per 4-CPU module • OS: SGI ProPack v3 • The compiler: Intel C++ and Fortran95 compiler version 8.1 beta • All experiments use 32 threads mapped onto 32 processors (one CPU per thread) MSL, EECS, SNU
R-brick crossbar switch C-brick a.k.a. compute node Super-Bedrock ASIC Direct connection to I/O 6.4GB/s each Router ASIC SGI Altix3000 Block Diagram MSL, EECS, SNU
4.2. Impact of Software Data-Prefetching • By enabling data-prefetching in the Intel compiler with optimization such as • parallelization, privatization • loop transformations • IPO • scalar replacement • prefetching • software pipelining • The data-prefetching phase is after all these optimization • can benefit from the previous ones, more effective prefetching • The interaction between these optimization is very complex MSL, EECS, SNU
Performance gain with software data-prefetching • 314.mgrid_m: almost 100% gain • 6 others: greater than 10% • 332.ammp_m: less than 1% gain • We discuss in detail in the following MSL, EECS, SNU
4.3. Impact of Prefetching for Loads Only • For applications that are memory bounded, load/store dominates performance • adding extra prefetch will increase the pressure on memory channel • On SMP system, avoiding resource contention on the memory system is important • Experiment: issuing prefetches only for memory references that are loads • and compared the result with the full prefetching for loads and stores version MSL, EECS, SNU
Gain/loss with prefetching loads only • Reduced memory bandwidth pressure → performance gain • 312.swim_m and 314.mgrid_m • memory bandwidth bound applications • with a lot of streaming data accesses • For not memory bounded programs, performance loss • due to the memory latency of stores • Not a general applicable scheme for most application • The geometric mean: 0.06% MSL, EECS, SNU
4.4. Prefetching for Spatial Locality • When a cache line is filled • it contains a number of elements of an array • For a data reference with spatial locality • need only one prefetch instruction in several iterations • no cache misses for this memory reference MSL, EECS, SNU
Gain/loss with prefetching for referencesexhibiting spatial locality • Geometric mean: 21.89% performance gain • 332.ammp_m: a slowdown • due to a noise of performance measurement (OS thread scheduling) • compared to section 4.2, this contributes 73.09% of the total gain • usual loops exhibit spatial locality • compiler should take advantage of it MSL, EECS, SNU
4.5. Prefetching using Rotating Registers • Itanium 2 process has rotating registers • Register rotation provides a hardware renaming mechanism • helps the compiler to control prefetching with minimum overhead • Clever scheme of optimizing software data-prefetching • reduces the number of issue slots for prefetch instructions • avoids branch mispredict penalties • conditionals or predicats computation • avoids the need for loop unrolling • Note: some of the prefetches will be redundant (same cache line) MSL, EECS, SNU
Gain/Loss of Prefetching using Rotating Registers • baseline performance • without using the rotating register scheme • using a conditional statement inside the loop • may get predicated by the compiler • Geometric mean: 2.71% gain • prefetching using the rotating registers scheme brings a positive impact MSL, EECS, SNU
4.6. Prefetching for Spatial References with No Predication • Almost all Itanium 2 process instructions have a qualifying predicate • 64 predicate registers: p0-p63 • Rotating predicate registers • to avoid overwriting a predicate value that is still alive • to control the filling and draining of a software pipelining • To prefetch spatially-local references, compiler minimizes redundant prefetches and avoid branch mispredict penalties • Note: the rotating register technique works only for software pipelined loops MSL, EECS, SNU
Gain/Loss of prefetching with no predication • 7 out of 11 applications achieved performance gains • Itanium 2 processor discards redundant prefetch instructions • 324.apsi_m: 13.25% gain • because it is memory bounded program MSL, EECS, SNU
4.7. Impact of Prefetching References with No Spatial Locality • The overheads of prefetching memory references in loops that exhibit no spatial locality • prefetched in every iteration of the loop • cannot be optimized using predication or rotating registers • We measure the performance impact of prefetching for such data references MSL, EECS, SNU
Gain/loss of prefetching references with no locality • 8 out of 11 applications achieved gain • 320.equake_m: a nice performance boost • with this aggressive prefetching scheme MSL, EECS, SNU
4.8. Impact of Prefetching for Outer Loops • Generally, prefetching is applied to inner-loop • most performance gains • Some applications may get a benefit by issuing prefetches for references that appear in outer loops MSL, EECS, SNU
Gain/loss with prefetching for outer loop • 7 out of 11 get a performance degradation • Geometric mean gets -0.48% degradation • Enabling prefetching for outer loops brings minor negative impacts MSL, EECS, SNU
4.9. Prefetching Arrays with Indirect Indexing • For an indexed array reference (ex: a[b[i]]), the memory indirection requires a sophisticated prefetching strategy • If cache misses occur for both the index array and the data array, prefetches have to be issued for both references • The distance for index array should be larger than the distance for data array • to ensure that the index array does not encounter cache miss • Data speculation support in Itanium 2 is used to load the index array while computing this address for the data array • any out-of-bound accesses of the index array are silently ignored (no exception) • Generally this kind of prefetching is useful for application with irregular memory access patterns MSL, EECS, SNU
Prefetching memory with indirect array indexing • the impact is minor for SPEC OMPM2001 suite • this is not the most valuable scheme for SPEC OMPM2001 MSL, EECS, SNU
Concluding Remarks • This is the first paper that studies the impact of software data-prefetching on the SPEC OMPM2001 • Software data-prefetching is an effective compiler technique for tolerating the long memory latency without notably increasing the memory traffic • Most of the performance gain is from • prefetching for memory accesses exhibiting spatial locality • prefetching for array references with no spatial locality • prefetching using rotating registers • It remains to be seen if more advanced data-prefetching can bring performance gain on the Intel Itanium 2 processor based SMP systems MSL, EECS, SNU
References • STREAM, http://www.cs.virginia.edu/stream/ • SGI Altix 3000 User’s Guide • Intel Itanium Architecture Software Developer’s Manual - Volume 3: Instruction Set Reference MSL, EECS, SNU