TCC on Smart Memories

TCC on Smart Memories Amin Firoozshahian 26 April 2004

Outline • TCC implementation requirements • Register Checkpoints • Augmenting cache • Commits • Violation detections • Squashes • Arbitrations • Overflows • Open Issues

TCC Implementation Requirements • TCC is very similar to TLS, it requires: • Checkpointing system state • Keeping track of speculative reads / writes • Detecting RAW violations • Commits and Squashes • But it also has some slight differences: • No need to have coherency protocol running • No specific order between running transactions (un-ordered) • Requires arbitration for initiating commits • No data forwarding requirements • Needs burst traffic on caches at commit times

Assumptions • Running just one Transaction per Tile • Second processor is turned off • Have gang-clear operation in the mat • No conditional gang-clear

Register Checkpoints • Need to save system state before starting each transaction • Use Tensilica’s register windowing capabilities to create fast checkpoints • Roll back the register window in case of violation • Same as TLS

… Augmenting Cache for TCC • Add SM and SR bits to each data word: • SM: Speculatively Modified (set upon Stores to data word) • SR: Speculatively Read (set upon Loads from data word) • Add SM and SR bits to cache lines as well • Bits are set either on processor accesses or cache line refills • Cleared on commit or squash SR SM Tag SR SM Data Word 1 SR SM Data Word 2

Augmenting Cache for TCC • Keep track of speculative write addresses in a separate buffer • Used to commit / squash speculative writes • Same as TLS FIFO Tag Data Data Address Data Transactional Processor

Problems • Evictions: Lines with SM and SR bits set can not be evicted • Solution: • Stall transaction on such refill • Request commit token • Continue non-speculatively when acquired the token • Sub-word accesses: Need per-byte SR and SM bits • Solution: • Conservatively use per-word SR and SM bits • On a byte write, set the SR bit of the word • Same as TLS

Commits • Commits happen when: • Transaction is successfully completed • Processor wins the arbitration for commit • Commit procedure: • Arbitrate for the commit token • Clear all SR bits for words and lines (gang) • Traverse address FIFO, access the cache and send out all SM words (sequential) • Clear all SM bits for words and lines • Clear address FIFO • Release commit token

Who does the commit? • Addresses for all SM bits are stored in the FIFO • Need one FIFO access followed by one Cache access to get the SM word • Slightly different from Indexed Scatter DMA • Commits can be performed either by DMA engine or by processor • Less Quad bus bandwidth requirements if done by the processor

Commit Broadcast • Invalidations for the addresses of the SM words should be broadcasted to all other caches • In the committing Quad: • Done by Cache Controller • In other Quads: • Memory Controller receives commit packets • Broadcasts them to all other Cache Controllers • Cache Controllers do broadcast inside the Quads • Can be update based or invalidation based

Violation detection • If commit invalidation goes to a word / line with SR bit set, a violation is detected • Detected by the Cache Controller • Reported to the processor • Based on Lines or Words? • Words: • Access to both data and tag mats • Better if using and update based protocol (same bandwidth requirements) • No false sharing (except sub-word writes) • Lines: • Commits are only broadcasted to tags • Less bandwidth requirements • False sharing problem

Squashes • Are requested by Cache Controller • Squash procedure: • Traverse FIFO, invalidate all SM lines (sequential) • Can be gang if we have conditional gang-clears • Clear all SR bits for both words and lines (gang) • Clear the FIFO • Restore to check point • In an update based commit scheme, data is already forwarded before restarting

Arbitration • Proposal: Do the arbitration in software • No need for additional hardware in the system • Can be flexible • Arbitration done by acquiring a series of locks • Locks need to be stored in an un-cached address region • Not as fast as hardware arbitration • But seems to be tolerable (TCC ISCA paper)

Arbitration Examples • Un-ordered transactions: • Acquire a single lock • Ordered transactions: • Use an array of locks, acquire yours release next • OR: Compare your phase number with the committing transaction • Acquire the lock if it’s your turn to commit • Partially ordered transactions: • Acquire one lock for the order • Acquire one global lock afterwards

Overflows • Happen when: • Address FIFO is full • Need to evict an SR / SM line from cache (run out of associativity) • Interrupt to software: • Acquire commit token • Continue non-speculatively: • Commit current speculative state • Broadcast all the writes until transaction ends • Release commit token

Two processor configuration? • Processors share cache in the Tile • Need to store state of two processors separately • Separate FIFOs for processors • Two sets of SR / SM bits • Only one set can be used for a line • Need to stall / squash one processor in case of conflicts • More conflict misses in the shared cache • More complexity • Do we want to have it?

Double buffering? • Can use second processor (or DMA engine) to commit state while original processor running another transaction • Can have two FIFOs • But… • Need to distinguish between old SM and new SM lines / words • Perhaps can use second set of SR / SM bits • New transaction needs to be stalled in case of conflict with old SM state • Do we want to have it?

TCC on Smart Memories

TCC on Smart Memories

Presentation Transcript

memories

Focus on Diagnosis: Persistence at TCC

Tensilica based simulator for Smart Memories

Memories

Memories

MEMORIES!

MEMORIES!

Memories

SPECIAL TCC MEETING ON 23-AUG-04

Memories

Save your Memories on Canvas

TCC

On False Memories : Cognitive Illusions

Add on To Memories