260 likes | 381 Views
Alias Speculation using Atomic Regions. (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign. Disclaimer. This talk is not about parallelism.
E N D
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign
Disclaimer • This talk is not about parallelism. • This talk is about decreasing the amount of work that needs to be done through better code generation. • We want to do this by making the software-hardware barrier more porous. Assumptions Compiler Hardware Information
What prevents good code generation? • Many popular optimizations require code motion • Loop Invariant Code Motion (LICM): From the body to the preheader of a loop • Redundancy elimination: From the location of the redundant computation to the first computation • Memory aliasing prevents code motion r1 = a + b … c = a + b r1 = a + b … r2 = a + b c = r2 r1 = a + b r2 = a + b … c = r2 r1 = a + b r2 = r1 … c = r2 r1 = a + b *p = … c = a + b r1 = a + b *p = … r2 = a + b c = r2 r1 = a + b r2 = a + b *p = … c = r2
Alias Analysis is Difficult • Alias analysis returns one of three results • Must-Alias, No-Alias, May-Alias • Accurate static analysis is fundamentally difficult • Requires points-to analysis, heap modeling etc. • Quickly becomes intractable in space/time complexity • Alternative: insert runtime checks • Software checks • Hardware checks (e.g. Itanium ALAT, Transmeta) • We propose to leverage atomic regions to do runtime checks and automatic recovery
Background: Atomic Regions (aka Transactions) • Sections of code demarcated in software that are either committed atomically on success or rolled back on failure • Atomic regions are here and now: • Intel TSX, AMD ASF, IBM Bluegene/Q, IBM Power • Originally to ease parallel programming… but again that’s not what the talk is about today • Does two things well that software finds difficult • Checkpointing: to guarantee atomic commit of transaction • Exposed to software through begin atomic, end atomic • Memory alias detection: to guarantee isolation of transaction • Hidden from software
Proposal: Leverage Atomic Regions for Alias Speculation • Expose alias checking HW to SW through ISA extensions • Use HW support for Atomic Regions to perform alias speculation in a compiler for optimizations • Cover path of code motion in an Atomic Region • Speculate may-aliases in code motion path are no-aliases • Check speculated aliases using alias checking HW • Recover from failure by rolling back to checkpoint • Apply this to optimizations such as: • Loop Invariant Code Motion (LICM) • Partial Redundancy Elimination (PRE) • Global Value Numbering (GVN)
Modifications to Atomic Regions • Key insight • Atomic regions maintain a read set and a write set • Speculative Read (SR), Speculative Written (SW) bits in speculative cache • Only SW bits are needed for checkpointing • Repurpose SR bits to mark certain load locations for monitoring alias speculation failures • Do not mark SR bits for regular loads • Add ISA extensions to manipulate and check SR and SW bits to do alias checks
Extensions to the ISA(for Checkpointing) • begin_atomic PC / end_atomic / abort_atomic • Starts / ends / aborts atomic region • PC is the address of the Safe-Version of atomic region • atomic region code without speculative optimizations • abort_atomic jumps to Safe-Version after rollback already supported
Extensions to the ISA(for Alias Checking) newly added • load.add.sr r1, addr • Loads location addr to r1 just like a regular load • Marks SR bit in cache line containing addr • Used for marking monitored loads • clear.sr addr • Clears SR bit in cache line containing addr • Used to mark end of load monitoring • store.chk.(sr / sw / srsw) addr, r1 • Stores r1 to location addr just like a regular store • sr: If SR bit is set, atomic region is aborted • sw: If SW bit is set, atomic region is aborted
How are these Instructions Used? • Instrumentation goals • Minimize alias checking instruction overhead • Allow alias checks on a subset of accesses in AR • A single AR can enable multiple optimizations • Each code motion involves only a subset of accesses • Two cases of code motion that involve alias checks • Moving (hoisting) loads • Moving (sinking) stores
Code Motion 1: Hoisting Loads begin_atomic store x load a store y end_atomic begin_atomic load.add.sr a store.chk.sr x clear.sr a store y end_atomic begin_atomic load.add.sr a store.chk.sr x store y end_atomic • Assume amay-alias with x and y • Hoist loada above store x and setup monitoring of a • store.chk.sr x will rollback AR on alias check failure • Sink clear.sr a to end of AR (if possible) • store y will not trigger rollback on alias with a • Now clear.sr a can be removed clear.sr a • Can selectively check against stores in path of code motion • (Often) no instruction overhead for checking
Code Motion 2: Sinking Stores begin_atomic store a load x store y end_atomic begin_atomic load.add.sr x store y store.chk.srsw a end_atomic • Assume amay-alias with x and y • Sink storea below load x and store y • Alias with x is checked when SR bits are checked in store.chk.srsw a • Alias with y is checked when SW bits are checked in store.chk.srsw a • Can selectively check only loads in path of code motion • Must check against all previous stores in atomic region • Because SW bits cannot be set selectively
Illustrative Example: LICM and GVN // a,b may alias with *p,*q,*s. // *p,*q,*s may alias with each // other. for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20; } // PC points to the original loop begin_atomic PC for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20; } end_atomic • Put atomic region around loop • Perform optimizations after inserting appropriate checks
Illustrative Example: LICM and GVN // a aliases with *p,*q // b aliases with *p // *p,*q,*s aliases with each other for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20; } // PC points to the original loop register int r1, r2; begin_atomic PC ld.add.sr r1, b r2 = r1 + 10; for(i=0; i < 100; i++) { store a, r2; store.chk.sr *p, *q + 20; store *s,*q + 20; } clear.sr b end_atomic • Put atomic region around loop • Perform optimizations after inserting appropriate checks • Hoist b + 10 (LICM)
Illustrative Example: LICM and GVN // a aliases with *p,*q // b aliases with *p // *p,*q,*s aliases with each other for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20; } // PC points to the original loop register int r1, r2, r3; begin_atomic PC ld.add.sr r1, b r2 = r1 + 10; for(i=0; i < 100; i++) { store a, r2; ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; clear.sr *q store *s,r4; } clear.sr b end_atomic • Put atomic region around loop • Perform optimizations after inserting appropriate checks • Hoist b + 10 (LICM) • Eliminate 2nd *q + 20 (GVN)
Illustrative Example: LICM and GVN // a aliases with *p,*q // b aliases with *p // *p,*q,*s aliases with each other for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20; } // PC points to the original loop register int r1, r2, r3; begin_atomic PC ld.add.sr r1, b r2 = r1 + 10; for(i=0; i < 100; i++) { store a, r2; ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr *p, r4; store *s,r4; } clear.sr *q clear.sr b end_atomic • Put atomic region around loop • Perform optimizations after inserting appropriate checks • Hoist b + 10 (LICM) • Eliminate second c + i (GVN) • Sink clear.sr *q
Illustrative Example: LICM and GVN // a aliases with *p,*q // b aliases with *p // *p,*q,*s aliases with each other for(i=0; i < 100; i++) { a = b + 10; *p = *q + 20; *s = *q + 20; } // PC points to the original loop register int r1, r2, r3; begin_atomic PC ld.add.sr r1, b r2 = r1 + 10; for(i=0; i < 100; i++) { ld.add.sr r3, *q r4 = r3 + 20 store.chk.sr*p, r4; store *s,r4; } store.chk.srsw a, r2; clear.sr *q clear.sr b end_atomic • Put atomic region around loop • Perform optimizations after inserting appropriate checks • Hoist b + 10 (LICM) • Eliminate second c + i (GVN) • Sink clear.sr *q • Sink a = r1 (LICM) Checked needlessly but is fine since it does not alias with “a”
Where should we Place Atomic Regions? • We chose to focus on loops • Where most of the execution time is spent • Loops provide ample range for opts such as LICM or PRE to perform large scale redundancy elimination • Can amortize cost of atomic region instrumentation over multiple iterations for a given optimization • When loops can potentially overflow speculation resources, loops are blocked into nested sub-loops appropriately
Memory Consistency Issues • In a multiprocessor system, disabling conflict checks on speculative read lines can change access ordering • Stores commit out of order at the end of an atomic region even when loads read values from remote processors • Conventionally, this causes a rollback • Not a problem in reality • Compiler code motion cause access re-orderings anyway. • If it is legal for the compiler to re-order, it is legal for HW • If it was illegal for the compiler to re-order (e.g. due to synchronization), the atomic region would not be placed there
Compiler Toolchain • Run loop blocking pass that uses loop footprint estimation • Run application instrumented with alias check instructions to profile how many Atomic Region aborts a particular speculation would have caused. • Run Atomic Region instrumentation pass for loops that would benefit according to a cost-benefit model and the abort profile information. • Run modified optimization passes (e.g. LICM, PRE, GVN) that perform the code movements deemed beneficial by the cost-benefit model. Insert appropriate alias checks.
Experimental Setup • Compare three environments using LICM and GVN/PRE optimizations: • BaselineAA: • Unmodified LLVM-2.8 using basic alias analysis • Default alias analysis used by –O3 optimization • DSAA: • Unmodified LLVM-2.8 using data structure alias analysis • Experimental alias analysis with high time/space complexity • LAS: • Modified LLVM-2.8 using loop-based alias speculation • Applications: • SPEC INT2006, SPEC FP2006 • Simulation: • SESC with Pin-based front end with Atomic Region support • 32KB 8-way associative speculative L1 cache w/ 64B lines
Alias Analysis Results • Breakdown of alias analysis results when run with LICM pass • LAS is able to convert almost all may-aliases to no-aliases using profile information
Speedups • Speedups normalized to BaselineAA
Atomic Region Characterization • Low L1 cache occupancy due to not buffering speculatively read lines • Overhead amortized over large atomic region
Summary • Proposed exposing HW Atomic Region alias checking primitive to SW using ISA extensions • Proposed loop-based Atomic Region instrumentation • To maximize speculation opportunity • To minimize instrumentation overhead • Proposed an alias speculation framework leveraging Atomic Regions and evaluated using LICM and GVN/PRE • May-alias results: 56% → 4% SPECINT2006, 43% → 1% SPECFP2006 • Speedup: 3% for SPECINT2006, 9% for SPECFP2006