90 likes | 183 Views
Memory Consistency in Vector IRAM. David Martin. The Memory Consistency Model. Consistency model applies to instructions in a single instruction stream (different than multi-processor consistency!). a = after V = vector R = read VP = virtual processor
E N D
Memory Consistencyin Vector IRAM David Martin
The Memory Consistency Model • Consistency model applies to instructions in a single instruction stream (different than multi-processor consistency!). a = after V = vector R = read VP = virtual processor W = write * = no sync required S = scalar + = sync required • Definition of a “XaY” sync: • All operations of type Y occurring before the sync in program order appear to execute before any operation of type X occurring after the sync in program order. • Definition of a “XaY” sync to vector register $vri: • The most recent operation of type Y to $vri appears to execute before any operation of type X occurring after the sync in program order.
Why Relax Memory Consistency? • Natural micro-architecture has multiple paths to memory • Want to decouple scalar and vector units without complex hardware Fetch Scalar Core Sync Vector Unit Memory • Trade-off between more complex hardware (speculation, disambiguation, cache coherence) and more complex software (sync instructions) • Should explore solutions to this trade-off that involve more hardware: e.g. Hardware guarantees SaV and VaS ordering, but leaves VaV and VP orderings to software.
Software Conventions for Syncs • Vector code is responsible for not messing things up. • Allows us to vectorize libraries to speed up existing programs. • Don’t want to assume that our compiler will compile and globally optimize all non-vector code that we run. • Alternative model: Pass around flags to communicate sync requirements or history • Must assume that our compiler compiles all code run on IRAM. • Not sure we want to accept that restriction. Vector Function Conventions: 1. Execute VaS and VaV syncs on entry to vector code. 2. Execute SaV sync on exit from vector code. VaS,VaV Scalar Code Vector Code SaV
Sync Implementations and Costs • SaV : Stall fetch unit until vector unit has committed all vector memory instructions. • Could take 1000s of cycles with many indexed vector memory operations in flight! • Very difficult to delay issue since it is often issued at the end of a vector routine. • VaS : Stall fetch unit until scalar unit has committed all scalar memory instructions. • Not too expensive (10s of cycles?) because scalar unit is ahead of the vector unit, because the scalar core is simple, and because the data cache is write-thru. • Easy to delay issue because it is often issued at the start of a vector routine. • VaV and VPaVP: No operation. • Nop because we have 1 vector memory unit and no vector caches.
Current Sync Analysis Tool • Executes a program and tells you: 1. Whenever two memory references are not: • Ordered by architectural guarantees • Ordered by register dependencies • Ordered by an intervening sync instruction 2. Whenever a sync instruction is not used to resolve any hazard, as described in (1). • Caveats: • Hazards are detected from a single program execution: Information may not hold true for all possible executions of the program. • Hazard detection is conservative in the presence of synchronization chains. Two Examples of Synchronization Chains Write(A) <- r1 RAW SYNC Read(A) <- r2 WAR SYNC Write(A) <- r3 Write(A) <- r1 RAW SYNC Read(A) <- r2 Write(A) <- r2 Hazard? Hazard?
Optimizing Code • Basic problem: • Vector unit requires setup: VL, VPW, mask, exceptions • Vector code responsible for issuing syncs • Both of these are required in a vector routine if nothing is known about the calling context! • All solutions share the notion of giving control of the calling context to the compiler. Two options: (1) Pass around flags so that syncs and setup code can be avoided at run-time (2) Do global optimizations so that syncs and setup code can be eliminated at compile-time • . • . • . • Scalar code • Vector setup • VaS and VaV sync • Vector function • SaV sync • Scalar code • Vector setup • VaS and VaV sync • Vector function • SaV sync • Scalar code • . • . • .
Optimization Example • Demonstrates potential benefit from optimizing scalar-vector communication • Code computes A+B+C+D+E+F in the following manner: A B C D E F • Unoptimized code calls a general vector add routine 5 times • First optimization inlines the 5 routines and removes vector initialization sequences • Second optimization also removes unnecessary sync instructions + + + + + • Optimization goal is to avoid “sawtooth” in instantaneous performance graphs caused by draining the vector pipelines between vector loops
Large optimization potential for short vector loops. • SaV syncs are most important to eliminate or delay. • VaS sync performance impact is unclear. • VaV syncs are virtually free in VIRAM-1. • Setup code is expensive. For this example, it is as expensive as the SaV syncs.