1 / 18

TCC on Smart Memories

TCC on Smart Memories. Amin Firoozshahian 26 April 2004. Outline. TCC implementation requirements Register Checkpoints Augmenting cache Commits Violation detections Squashes Arbitrations Overflows Open Issues. TCC Implementation Requirements. TCC is very similar to TLS, it requires:

sargent
Download Presentation

TCC on Smart Memories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TCC on Smart Memories Amin Firoozshahian 26 April 2004

  2. Outline • TCC implementation requirements • Register Checkpoints • Augmenting cache • Commits • Violation detections • Squashes • Arbitrations • Overflows • Open Issues

  3. TCC Implementation Requirements • TCC is very similar to TLS, it requires: • Checkpointing system state • Keeping track of speculative reads / writes • Detecting RAW violations • Commits and Squashes • But it also has some slight differences: • No need to have coherency protocol running • No specific order between running transactions (un-ordered) • Requires arbitration for initiating commits • No data forwarding requirements • Needs burst traffic on caches at commit times

  4. Assumptions • Running just one Transaction per Tile • Second processor is turned off • Have gang-clear operation in the mat • No conditional gang-clear

  5. Register Checkpoints • Need to save system state before starting each transaction • Use Tensilica’s register windowing capabilities to create fast checkpoints • Roll back the register window in case of violation • Same as TLS

  6. Augmenting Cache for TCC • Add SM and SR bits to each data word: • SM: Speculatively Modified (set upon Stores to data word) • SR: Speculatively Read (set upon Loads from data word) • Add SM and SR bits to cache lines as well • Bits are set either on processor accesses or cache line refills • Cleared on commit or squash SR SM Tag SR SM Data Word 1 SR SM Data Word 2

  7. Augmenting Cache for TCC • Keep track of speculative write addresses in a separate buffer • Used to commit / squash speculative writes • Same as TLS FIFO Tag Data Data Address Data Transactional Processor

  8. Problems • Evictions: Lines with SM and SR bits set can not be evicted • Solution: • Stall transaction on such refill • Request commit token • Continue non-speculatively when acquired the token • Sub-word accesses: Need per-byte SR and SM bits • Solution: • Conservatively use per-word SR and SM bits • On a byte write, set the SR bit of the word • Same as TLS

  9. Commits • Commits happen when: • Transaction is successfully completed • Processor wins the arbitration for commit • Commit procedure: • Arbitrate for the commit token • Clear all SR bits for words and lines (gang) • Traverse address FIFO, access the cache and send out all SM words (sequential) • Clear all SM bits for words and lines • Clear address FIFO • Release commit token

  10. Who does the commit? • Addresses for all SM bits are stored in the FIFO • Need one FIFO access followed by one Cache access to get the SM word • Slightly different from Indexed Scatter DMA • Commits can be performed either by DMA engine or by processor • Less Quad bus bandwidth requirements if done by the processor

  11. Commit Broadcast • Invalidations for the addresses of the SM words should be broadcasted to all other caches • In the committing Quad: • Done by Cache Controller • In other Quads: • Memory Controller receives commit packets • Broadcasts them to all other Cache Controllers • Cache Controllers do broadcast inside the Quads • Can be update based or invalidation based

  12. Violation detection • If commit invalidation goes to a word / line with SR bit set, a violation is detected • Detected by the Cache Controller • Reported to the processor • Based on Lines or Words? • Words: • Access to both data and tag mats • Better if using and update based protocol (same bandwidth requirements) • No false sharing (except sub-word writes) • Lines: • Commits are only broadcasted to tags • Less bandwidth requirements • False sharing problem

  13. Squashes • Are requested by Cache Controller • Squash procedure: • Traverse FIFO, invalidate all SM lines (sequential) • Can be gang if we have conditional gang-clears • Clear all SR bits for both words and lines (gang) • Clear the FIFO • Restore to check point • In an update based commit scheme, data is already forwarded before restarting

  14. Arbitration • Proposal: Do the arbitration in software • No need for additional hardware in the system • Can be flexible • Arbitration done by acquiring a series of locks • Locks need to be stored in an un-cached address region • Not as fast as hardware arbitration • But seems to be tolerable (TCC ISCA paper)

  15. Arbitration Examples • Un-ordered transactions: • Acquire a single lock • Ordered transactions: • Use an array of locks, acquire yours release next • OR: Compare your phase number with the committing transaction • Acquire the lock if it’s your turn to commit • Partially ordered transactions: • Acquire one lock for the order • Acquire one global lock afterwards

  16. Overflows • Happen when: • Address FIFO is full • Need to evict an SR / SM line from cache (run out of associativity) • Interrupt to software: • Acquire commit token • Continue non-speculatively: • Commit current speculative state • Broadcast all the writes until transaction ends • Release commit token

  17. Two processor configuration? • Processors share cache in the Tile • Need to store state of two processors separately • Separate FIFOs for processors • Two sets of SR / SM bits • Only one set can be used for a line • Need to stall / squash one processor in case of conflicts • More conflict misses in the shared cache • More complexity • Do we want to have it?

  18. Double buffering? • Can use second processor (or DMA engine) to commit state while original processor running another transaction • Can have two FIFOs • But… • Need to distinguish between old SM and new SM lines / words • Perhaps can use second set of SR / SM bits • New transaction needs to be stalled in case of conflict with old SM state • Do we want to have it?

More Related