Practical Reduction for Store Buffers

Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI

problem • practical reasoning about imperative code is based on state assertions and invariants • such reasoning tacitly assumes sequential consistency (SC) … • … but real MP hardware doesn’t provide SC • needed: a programming discipline that • guarantees SC • is flexible enough to handle real software • is practical to check

x86/x64 hardware model: TSO • FIFO store buffer (SB) between each processor (P) and the (shared, SC) memory • P writes are queued onto its SB • concurrently, writes leave SBs and are applied to memory • a read by P reads from P’s SB if possible; otherwise, it reads from memory (“SB forwarding”) • P can flush its own SB (expensive) • note: TSO != “load-acquire, store-release” • a read can move backward past a write to the same location, turning into a read of a constant • note: UP TSO machines are SC, but …

TSO is not SC • TSO is not SC, because of the delay in writes becoming visible to other processors, e.g. P0: <x := 1> <y = 0> P1: <y := 1> <x = 0> • both Ps can complete under TSO, but not under SC (whichever thread writes second gets stuck)

a simple SC discipline • make sure that P reads only when P’s SB is empty • writes dirty the SB; flushes clean it • read allowed only when the SB is clean • (lazy caching uses a similar trick to achieve SC) • proof of SC: • each P simulates a virtual P (that might fall behind) • virtual P takes a write step when that write hits memory • real and virtual P are in sync on read steps • but this discipline isn’t practical • disjoint concurrency shouldn’t require any flushes! • idea: distinguish private and shared memory

ownership • each location can be either owned (by a unique processor) or unowned • each access is volatile or nonvolatile • modified discipline: • nonvolatile access requires ownership of the location • volatile writes dirty the SB • volatile reads allowed only when SB is clean • simulation proof is similar, but novolatile accesses happen as soon as there are no volatile writes in front of them • they’re guaranteed to see the same values when they hit the SB, because other Ps don’t modify

moving ownership around • use ghost operations to take and release ownership • P can take ownership of unowned locations • P can release ownership of locations P owns • (this fits with ownership in VCC, where “unowned” means owned by a data object rather than a thread) • discipline in the paper also adds unowned read-only locations, which allows shared non-volatile reading

ex: spinlocks typedef… struct _SPIN_LOCK { volatileint Lock; _(ghost \object prot_obj;) _(invariant !Lock ==> \mine(prot_obj)) } SPIN_LOCK; void Acquire(SPIN_LOCK *SpinLock …) … { int stop; do { …{ //atomic stop = (__interlockedcompareexchange(&SpinLock->Lock, 1, 0) == 0); _(if (stop) \giveup_closed_owner(SpinLock->prot_obj, SpinLock);) } } while (!stop); } Microsoft confidential

key points • discipline follows some basic VCC methodology • discipline expressed in terms of ghost state • ghost code “witnesses” conformance to the discipline (much as ghost code is used to witness simulations) • by replacing proof obligations with programming obligations, we’re more likely to get programmers to do it • when checking the discipline, we get to assume a SC execution, so we never have to think about the SBs.

the only tricky part of the proof • key observation: ownership changes cannot race on their own • if they do, there are executions that violate the discipline • therefore, we can pretend that ownership doesn’t get released until the next volatile write

a note on ghosts • VCC requires lots of ghost code, incude racy operations on volatile ghost state • why doesn’t this introduce flushing? SC code follows discipline on real data => {SC stripped code simulates SC code} SC stripped code follows discipline on real data => {reduction theorem} stripped code simulates SC stripped code => {SC stripped code simulates SC code} stripped code simulates SC code

how close is this to practice? • discipline followed almost everywhere in the Hv codebase • even non-interlocked volatile writes are fairly rare • exceptions (outside of device ops) are writes where • the write doesn’t race with other writes • racing reads can safely read the old value • ex: releasing a spinlock, broadcasting signals • a solution: introduce a new kind of volatile • one reader, multiple writers • keep track of an upper and lower bound • writes must be above the upper bound, • writes raise the upper bound • flush raises the lower bound to the upper bound • reads by other processors raise the lower bound to the value read • this works, but is kind of gross

Practical Reduction for Store Buffers

Practical Reduction for Store Buffers

Presentation Transcript

Buffers

Buffers

Buffers

Buffers

Buffers

BUFFERS

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers