1 / 45

Lecture 9 Outline

Lecture 9 Outline. MESI protocol Dragon update-based protocol Impact of protocol optimizations. Lower-Level Protocol Choices. BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon good for mostly read data what about “migratory” data, thus:

jania
Download Presentation

Lecture 9 Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 9 Outline • MESI protocol • Dragon update-based protocol • Impact of protocol optimizations ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  2. Lower-Level Protocol Choices • BusRd observed in M state: what transition to make? • Change to S: assume I’ll read again soon • good for mostly read data • what about “migratory” data, thus: • Change to I: assume other will write to it (Synapse) • I read and write, then you read and write, then X reads and writes... • Sequent Symmetry and MIT Alewife use adaptive protocols ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  3. MESI (4-state) Invalidation Protocol • Problem with MSI protocol • Rd, Wr sequence incurs 2 transactions • even when no one is sharing (e.g., serial program!) • BusRd (I  S) followed by BusRdX or BusUpgr (S  M) • In general, coherence traffic from serial programs is unacceptable • Add exclusive state: • Invalid • Modified (dirty) • Shared (two or more caches may have copies) • Exclusive (only this cache has clean copy, same value as in memory) • How to decide I  E or I  S? • Need to check whether someone else has copy • “Shared” signal on bus: wired-or line asserted in response to BusRd ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  4. PrRd/– PrRd/– PrWr/– M S E I PrWr/– PrWr/BusRdX PrRd/BusRd(~S) PrWr/BusRdX PrRd/BusRd(S) PrRd/– MESI: Processor-Initiated Transactions ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  5. M I E S BusRdX/Flush BusRd/Flush BusRd/Flush BusRdX/Flush BusRdX/Flush׳ BusRd/– BusRdX/– BusRd/Flush׳ MESI: Bus-Initiated Transactions ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  6. MESI State Transition Diagram • BusRd(S) means shared line asserted on BusRd transaction ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  7. Flush vs. Flush' • Flush: mandatory • Flush'happens only when • Cache-to-cache sharing is used, and, • Only one cache flushes data ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  8. P2 MESI Visualization P1 P3 Cache Snooper Snooper Snooper Bus Mem Ctrl Main Memory X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  9. P2 rd &X BusRd MESI Visualization P1 P3 Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  10. P2 MESI Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  11. P2 wr &X (X=2) M 2 MESI Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl One less bus request due to Exclusive state, esp. for serial programs X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  12. P2 rd &X BusRd MESI Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  13. P2 S X=2 S Flush 2 MESI Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  14. P2 wr &X X=3 I 3 M BusUpgr MESI Visualization P1 P3 X=2 S X=2 S Snooper Snooper Snooper Mem Ctrl Note: BusUpgr instead of BusRdX X=2 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  15. P2 rd &X 3 S Flush BusRd 3 S MESI Visualization P1 P3 X=2 I X=3 M Snooper Snooper Snooper Mem Ctrl X=2 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  16. P2 rd &X MESI Visualization P1 P3 X=3 S X=3 S Snooper Snooper Snooper Mem Ctrl X=3 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  17. rd &X X=3 S BusRd Flush1 MESI Visualization P1 P2 P3 X=3 S X=3 S Snooper Snooper Snooper Mem Ctrl Referred to as Cache-to-cache transfer in Illinois MESI protocol X=3 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  18. MESI Example (Cache-to-Cache Transfer) * Data from memory if no cache2cache transfer, BusRd/- ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  19. MESI Example (Cache-to-Cache Transfer+BusUpgr) * Data from memory if no cache2cache transfer, BusRd/- ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  20. Lower-Level Protocol Choices • Who supplies data on miss when not in M state: memory or cache? • Original, lllinois MESI: cache • assume cache faster than memory (cache-to-cache transfer) • Not necessarily true • Adds complexity • How does memory know it should supply data? (must wait for caches) • Selection algorithm if multiple caches have valid data • Valuable for distributed memory • May be cheaper to obtain from nearby cache than distant memory • Especially when constructed out of SMP nodes (Stanford DASH) ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  21. Lecture 9 Outline • MESI protocol • Dragon update-based protocol • Impact of protocol optimizations ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  22. Dragon Writeback Update Protocol • Four states • Exclusive-clean (E): I and memory have it • Shared clean (Sc): I, others, and maybe memory, but I’m not owner • Shared modified (Sm): I and others but not memory, and I’m the owner • Sm and Sc can coexist in different caches, with at most one Sm • Modified or dirty (M): I and, no one else • On replacement: Sc can silently drop, Sm has to flush • No invalid state • If in cache, cannot be invalid • If not present in cache, can view as being in not-present or invalid state • New processor events: PrRdMiss, PrWrMiss • Introduced to specify actions when block not present in cache • New bus transaction: BusUpd • Broadcasts single word written on bus; updates other relevant caches ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  23. E M Sc Sm Dragon: Processor-Initiated Transactions PrRd/– PrRd/– PrRdMiss/BusRd(~S) PrWr/BusUpd(S) PrRdMiss/BusRd(S) PrWr/BusUpd(~S) PrWr/– PrWrMiss/ (BusRd(S);BusUpd) PrRdMiss/BusRd(~S) PrWr/BusUpd(~S) PrRd/– PrRd/– PrWr/BusUpd(S) PrWr/– ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  24. E M Sc Sm Dragon: Bus-Initiated Transactions BusRd/– BusUpd/Update BusRd/– BusUpd/Update BusRd/Flush BusRd/Flush ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  25. PrRd/— BusUpd/Update PrRd/— BusRd/— Sc E PrRdMiss/ BusRd(S) PrRdMiss/ BusRd(S) PrWr/— PrWr/BusUpd(S) PrWr/ BusUpd(S) BusUpd/Update PrWrMiss/ (BusRd(S); BusUpd) PrWrMiss/ BusRd(S) BusRd/Flush Sm M PrWr/BusUpd(S) PrRd/— PrRd/— PrWr/BusUpd(S) PrWr/— BusRd/Flush Dragon State Transition Diagram ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  26. P2 Dragon Visualization P1 P3 Cache Snooper Snooper Snooper Bus Mem Ctrl Main Memory X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  27. P2 rd &X BusRd Dragon Visualization P1 P3 Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  28. P2 Dragon Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  29. P2 wr &X (X=2) M 2 Dragon Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl One less bus request due to Exclusive state, esp. for serial programs X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  30. P2 rd &X BusRd Dragon Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  31. P2 Sm X=2 Sc Dragon Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  32. P2 wr &X X=3 3 Sc 3 Sm BusUpd Dragon Visualization P1 P3 X=2 Sm X=2 Sc Snooper Snooper Snooper Mem Ctrl Note: BusUpdate instead of BusUpgr (no inval is performed) X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  33. P2 rd &X Dragon Visualization P1 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl This is a miss in the MESI and MSI protocols X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  34. P2 rd &X Dragon Visualization P1 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  35. rd &X X=3 Sc BusRd Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl Note: Only the cache in State Sm is responsible for cache-to-cache transfer X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  36. X=3 Sc Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl P1 replaces X X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  37. X=3 Sc Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl vs. MSI or MESI where write-back only when the line is in M state P3 replaces X Owner responsible for writing back to mem X=1 3 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  38. Dragon Example ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  39. Lower-Level Protocol Choices • Can shared-modified state be eliminated? • If update memory as well on BusUpd transactions (DEC Firefly) • Dragon protocol doesn’t (assumes DRAM memory slow to update) • Should replacement of an Sc block be broadcast? • Would allow last copy to go to Exclusive state and not generate updates • Replacement bus transaction is not in critical path, later update may be • Shouldn’t update local copy on write hit before controller gets bus • Can mess up serialization • Coherence, consistency considerations much like write-through case • In general, many subtle race conditions in protocols • But first, let’s illustrate quantitative assessment at logical level ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  40. Lecture 9 Outline • MESI protocol • Dragon update-based protocol • Impact of protocol optimizations ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  41. Assessing Protocol Tradeoffs • Methodology: • Use simulator; choose parameters per earlier methodology (default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some) • Focus on frequencies, not end performance for now • transcends architectural details, but not what we’re really after • Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters • Cheap simulation: no need to model contention ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  42. 2 0 0 A d d r e s s b u s 1 8 0 D a t a b u s 1 6 0 ) 1 4 0 s / B M 8 0 1 2 0 ( c 7 0 A d d r e s s b u s i ) f 1 0 0 s f D a t a b u s / a 6 0 B r M T 8 0 ( 5 0 c i 4 0 6 0 f f a 3 0 r T 4 0 2 0 2 0 1 0 l x l x l t E x t d E I E 0 0 LU/III LU/3St Barnes/III Radix/III Ocean/3S Ocean/III OS-Data/III Radix/3St Appl-Data/III OS-Code/III Barnes/3St Raytrace/III OS-Data/3St Radiosity/III Appl-Code/III OS-Code/3St Appl-Data/3St Raytrace/3St LU/3St-RdEx Appl-Code/3St Radiosity/3St Barnes/3St-RdEx OS-Data/3St-RdEx Radix/3St-RdEx Ocean/3St-RdEx OS-Code/3St-RdEx Appl-Data/3St-RdEx Appl-Code/3St-RdEx Raytrace/3St-RdEx Radiosity/3St-RdEx Impact of Protocol Optimizations MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX) • MSI = MESI • Upgrades instead of read-exclusive helps • Same story when working sets don’t fit for Ocean, Radix, Raytrace ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  43. Impact of Cache-Block Size • Multiprocessors add new kind of miss to cold, capacity, conflict • Coherence misses: Due to invalidations • True sharing: Write to same word • False sharing: Write to different words • Reducing misses architecturally in invalidation protocol • Capacity: enlarge cache; increase block size (if spatial locality) • Conflict: increase associativity • Cold and coherence: only block size • Increasing block size has advantages and disadvantages • Can reduce misses if spatial locality is good • Can hurt too • increase misses due to false sharing if spatial locality not good • increase misses due to conflicts in fixed-size cache • increase traffic due to fetching unnecessary data and due to false sharing • can increase miss penalty and perhaps hit cost ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  44. 0 . 6 1 2 U p g r a d e U p g r a d e F a l s e s h a r i n g F a l s e s h a r i n g 0 . 5 1 0 T r u e s h a r i n g T r u e s h a r i n g C a p a c i t y C a p a c i t y C o l d C o l d 0 . 4 8 Miss rate (%) Miss rate (%) 0 . 3 6 0 . 2 4 0 . 1 2 8 8 0 8 6 6 2 4 8 0 Lu/64 Lu/8 Lu/16 Lu/32 Lu/128 Lu/256 Radix/8 Ocean/8 Radix/16 Radix/32 Radix/64 Barnes/8 Radiosity/8 Radiosity/32 Ocean/16 Ocean/64 Radix/128 Radix/256 Ocean/32 Barnes/32 Barnes/64 Barnes/16 Raytrace/8 Ocean/128 Ocean/256 Barnes/128 Barnes/256 Radiosity/256 Raytrace/16 Raytrace/32 Raytrace/64 Radiosity/16 Radiosity/64 Raytrace/128 Raytrace/256 Radiosity/128 Impact of Block Size on Miss Rate • For default problem size: vary block/line size from 8-256 Bytes • Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) • Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix) ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

  45. 0 . 1 8 A d d r e s s b u s 0 . 1 6 D a t a b u s 0 . 1 4 0 . 1 2 Traffic (bytes/instructions) 0 . 1 0 . 0 8 0 . 0 6 0 . 0 4 0 . 0 2 8 2 4 2 0 Barnes/8 Barnes/256 Barnes/16 Barnes/32 Barnes/64 Raytrace/8 Radiosity/8 Barnes/128 Raytrace/32 Raytrace/64 Raytrace/16 Radiosity/16 Radiosity/32 Radiosity/64 Raytrace/128 Raytrace/256 Radiosity/128 Radiosity/256 Impact of Block Size on Traffic Traffic (bytes/inst) affects performance indirectly through contention • Results different than for miss rate: traffic almost always increases • When working sets fits, overall traffic still small, except for Radix • Fixed overhead is significant component • So total traffic often minimized at 16-32 byte block, not smaller • Working set doesn’t fit: even 128-byte good for Ocean due to capacity • Address bus traffic behaves in opposite way as the data bus traffic ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin

More Related