1 / 34

Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance. 2005-23523 이영준 (Lee Young Joon ) MSL, EECS, SNU 2006.06.07. Contents. Introduction Intel Compiler Overview Prefetching for Itanium 2 Processor Experimental Evaluations Concluding Remarks Reference.

oshin
Download Presentation

Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Impact of Compiler-based Data-Prefetching Techniqueson SPEC OMP Application Performance 2005-23523 이영준(Lee Young Joon ) MSL, EECS, SNU 2006.06.07.

  2. Contents • Introduction • Intel Compiler Overview • Prefetching for Itanium 2 Processor • Experimental Evaluations • Concluding Remarks • Reference MSL, EECS, SNU

  3. 1. Introduction • The memory wall challenge • the processor-memory speed gap • Remedy • Latency tolerance • Software data-prefetching • Latency elimination • Long latency elimination techniques (locality optimizations) MSL, EECS, SNU

  4. In this paper, • Examine the impact of software data-prefetching on SPEC OMP application • OpenMP application performance on shared memory system • OpenMP C++/C and Fortran 2.0 standards • using Intel C++ and Fortran compilers • on an SGI Altix 32-way SMP machine built with Itanium 2 processors • Most of compiler analyses and optimization are done before data-prefetching stage • utilizing the services of an advanced memory disambiguation module • pointer analysis, address-taken analysis, array dependence analysis, language semantics, and other sources MSL, EECS, SNU

  5. 2. Intel Compiler Overview • Intel Itanium 2 processor has new architectural and micro-architectural features • Intel Itanium compiler takes advantage of it • EPIC (Explicitly Parallel Instruction Computing) for large amounts of ILP • Control and data speculation • allowing loads to be scheduled across branches or other memory operations • Predication MSL, EECS, SNU

  6. Intel Compiler Features • Supports both automatic optimization and programmer-controlled methods • Advanced compiler technologies • profile-guided multi-file inter-procedural analysis and optimizations • memory disambiguation/optimizations • parallelization • data and loop transformations • global code scheduling • predication, speculation • User can utilize multiprocessor • by making small changes to the source code • ex) OpenMP directives MSL, EECS, SNU

  7. Compiler Optimizations • Compiler optimizations in Intel compiler • Multi-Level Parallelism (MLP) • Instruction Level Parallelism + Thread Level Parallelism • Inter-Procedural Optimization (IPO) • points-to analysis (helps memory disambiguation), mod/ref analysis • High-Level Optimization (HLO) • loop transformation (loop fusion, loop tiling, loop unroll-and jam, loop distribution), software data prefetching, scalar replacement, data transformations • improve data locality and reduce memory access latency • Scalar Optimizations • branch-merging, strength reduction, constant propagation, dead code elimination, copy propagation, partial dead store elimination, and partial redundancy elimination (PRE) • Task Queuing Model • to exploit irregular parallelism effectively • extends scope beyond standard OpenMP programming model MSL, EECS, SNU

  8. 3. Prefetching for Itanium 2 Processor • Software data-prefetching • hide memory access latency • by moving referenced data closer to the CPU • do not block instruction stream • do not raise an exeption • Software data prefetching in Intel compiler takes advantage of Itanium 2 architectural features • prediction • rotating registers • data speculation MSL, EECS, SNU

  9. Rotating Registers r127 • enables succinct implementation of software pipelining with predication • rotated by one register position each time one of the special loop branches is executed • after one rotation, the content of register X will be found in register X+1 • r32-r127, f32-f127, p16-p63 (predicate regs) • others do not rotate (static registers) rotating (selectable) rX+1 rX r32 not rotating (static) r31 r0 r31 r0 MSL, EECS, SNU

  10. Prefetch Principles • Avoid already loaded data • already in cache • Issue at the right time • early so that it is available • lately so that it is not evicted • Prefetch distance • estimated based on memory latency, resource requirements, data-dependence info. prefetch request cache eviction MSL, EECS, SNU

  11. Data-locality analysis • Three types of data-locality that are identified in Intel compiler • Spatial locality • if data references inside a loop access different memory locations that fall within the same cache line • Temporal locality • if a data reference accesses the same memory location multiple times • Group locality • if different data references access the same cache line MSL, EECS, SNU

  12. An Example of Data-Prefetching • spatial locality: x(0), ..., x(99), and y(-1), ..., y(100) • group locality: y(k-1), y(k+1) - w.r.t. k loop iterations • if() statement can be replaced by predication • control dependence -> data dependence • reduces branch misprediction penalty • If cache line size=128B, array element size=8B • prefetch distance: D = 16 iterations • calculated by compiler • Assume k=0, D=8 • If the array elements x(k+D) and y(k-1+D) are prefetched, array accesses to x(9:15) and y(8:14) will hit the cache MSL, EECS, SNU

  13. Other... • Large number of registers • store memory addresses of prefetching in registers • no need for register spill and fill within loops • Itanium 2 architecture supports memory access hints • ex) if a data reference will not be reused, avoid cache pollution - lfetch 'nta' hint • These features support compiler to do better data reuse analysis on data movement across loop bodies • can avoid unnecessary prefetches MSL, EECS, SNU

  14. 4. Experimental Evaluation4.1. Methodology • the SPEC OMPM2001 benchmark suite • consists of a set of OpenMP based application programs • input reference data sets are derived from scientific computations on SMP systems • 11 large application programs • 8 in Fortran, 3 in C • requires a virtual address space of 2GB • larger than SPEC CPU2000 • can run in a 32-bit address space MSL, EECS, SNU

  15. Experimental System • SGI Altix3000 system • a distributed shared memory (DSM) architecture • NUMAflex (NUMA 3) • a global-address-space, cache-coherent multiprocessor • ccNUMA • 32 Intel Itanium 2 1.5GHz processors • each CPU has 16KBI+16KBD L1 cache • 256KB on chip L2 cache • 6MB on chip L3 cache • 256GB memory per 4-CPU module • OS: SGI ProPack v3 • The compiler: Intel C++ and Fortran95 compiler version 8.1 beta • All experiments use 32 threads mapped onto 32 processors (one CPU per thread) MSL, EECS, SNU

  16. R-brick crossbar switch C-brick a.k.a. compute node Super-Bedrock ASIC Direct connection to I/O 6.4GB/s each Router ASIC SGI Altix3000 Block Diagram MSL, EECS, SNU

  17. 4.2. Impact of Software Data-Prefetching • By enabling data-prefetching in the Intel compiler with optimization such as • parallelization, privatization • loop transformations • IPO • scalar replacement • prefetching • software pipelining • The data-prefetching phase is after all these optimization • can benefit from the previous ones, more effective prefetching • The interaction between these optimization is very complex MSL, EECS, SNU

  18. Performance gain with software data-prefetching • 314.mgrid_m: almost 100% gain • 6 others: greater than 10% • 332.ammp_m: less than 1% gain • We discuss in detail in the following MSL, EECS, SNU

  19. 4.3. Impact of Prefetching for Loads Only • For applications that are memory bounded, load/store dominates performance • adding extra prefetch will increase the pressure on memory channel • On SMP system, avoiding resource contention on the memory system is important • Experiment: issuing prefetches only for memory references that are loads • and compared the result with the full prefetching for loads and stores version MSL, EECS, SNU

  20. Gain/loss with prefetching loads only • Reduced memory bandwidth pressure → performance gain • 312.swim_m and 314.mgrid_m • memory bandwidth bound applications • with a lot of streaming data accesses • For not memory bounded programs, performance loss • due to the memory latency of stores • Not a general applicable scheme for most application • The geometric mean: 0.06% MSL, EECS, SNU

  21. 4.4. Prefetching for Spatial Locality • When a cache line is filled • it contains a number of elements of an array • For a data reference with spatial locality • need only one prefetch instruction in several iterations • no cache misses for this memory reference MSL, EECS, SNU

  22. Gain/loss with prefetching for referencesexhibiting spatial locality • Geometric mean: 21.89% performance gain • 332.ammp_m: a slowdown • due to a noise of performance measurement (OS thread scheduling) • compared to section 4.2, this contributes 73.09% of the total gain • usual loops exhibit spatial locality • compiler should take advantage of it MSL, EECS, SNU

  23. 4.5. Prefetching using Rotating Registers • Itanium 2 process has rotating registers • Register rotation provides a hardware renaming mechanism • helps the compiler to control prefetching with minimum overhead • Clever scheme of optimizing software data-prefetching • reduces the number of issue slots for prefetch instructions • avoids branch mispredict penalties • conditionals or predicats computation • avoids the need for loop unrolling • Note: some of the prefetches will be redundant (same cache line) MSL, EECS, SNU

  24. Gain/Loss of Prefetching using Rotating Registers • baseline performance • without using the rotating register scheme • using a conditional statement inside the loop • may get predicated by the compiler • Geometric mean: 2.71% gain • prefetching using the rotating registers scheme brings a positive impact MSL, EECS, SNU

  25. 4.6. Prefetching for Spatial References with No Predication • Almost all Itanium 2 process instructions have a qualifying predicate • 64 predicate registers: p0-p63 • Rotating predicate registers • to avoid overwriting a predicate value that is still alive • to control the filling and draining of a software pipelining • To prefetch spatially-local references, compiler minimizes redundant prefetches and avoid branch mispredict penalties • Note: the rotating register technique works only for software pipelined loops MSL, EECS, SNU

  26. Gain/Loss of prefetching with no predication • 7 out of 11 applications achieved performance gains • Itanium 2 processor discards redundant prefetch instructions • 324.apsi_m: 13.25% gain • because it is memory bounded program MSL, EECS, SNU

  27. 4.7. Impact of Prefetching References with No Spatial Locality • The overheads of prefetching memory references in loops that exhibit no spatial locality • prefetched in every iteration of the loop • cannot be optimized using predication or rotating registers • We measure the performance impact of prefetching for such data references MSL, EECS, SNU

  28. Gain/loss of prefetching references with no locality • 8 out of 11 applications achieved gain • 320.equake_m: a nice performance boost • with this aggressive prefetching scheme MSL, EECS, SNU

  29. 4.8. Impact of Prefetching for Outer Loops • Generally, prefetching is applied to inner-loop • most performance gains • Some applications may get a benefit by issuing prefetches for references that appear in outer loops MSL, EECS, SNU

  30. Gain/loss with prefetching for outer loop • 7 out of 11 get a performance degradation • Geometric mean gets -0.48% degradation • Enabling prefetching for outer loops brings minor negative impacts MSL, EECS, SNU

  31. 4.9. Prefetching Arrays with Indirect Indexing • For an indexed array reference (ex: a[b[i]]), the memory indirection requires a sophisticated prefetching strategy • If cache misses occur for both the index array and the data array, prefetches have to be issued for both references • The distance for index array should be larger than the distance for data array • to ensure that the index array does not encounter cache miss • Data speculation support in Itanium 2 is used to load the index array while computing this address for the data array • any out-of-bound accesses of the index array are silently ignored (no exception) • Generally this kind of prefetching is useful for application with irregular memory access patterns MSL, EECS, SNU

  32. Prefetching memory with indirect array indexing • the impact is minor for SPEC OMPM2001 suite • this is not the most valuable scheme for SPEC OMPM2001 MSL, EECS, SNU

  33. Concluding Remarks • This is the first paper that studies the impact of software data-prefetching on the SPEC OMPM2001 • Software data-prefetching is an effective compiler technique for tolerating the long memory latency without notably increasing the memory traffic • Most of the performance gain is from • prefetching for memory accesses exhibiting spatial locality • prefetching for array references with no spatial locality • prefetching using rotating registers • It remains to be seen if more advanced data-prefetching can bring performance gain on the Intel Itanium 2 processor based SMP systems MSL, EECS, SNU

  34. References • STREAM, http://www.cs.virginia.edu/stream/ • SGI Altix 3000 User’s Guide • Intel Itanium Architecture Software Developer’s Manual - Volume 3: Instruction Set Reference MSL, EECS, SNU

More Related