450 likes | 622 Views
Lecture 9 Outline. MESI protocol Dragon update-based protocol Impact of protocol optimizations. Lower-Level Protocol Choices. BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon good for mostly read data what about “migratory” data, thus:
E N D
Lecture 9 Outline • MESI protocol • Dragon update-based protocol • Impact of protocol optimizations ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Lower-Level Protocol Choices • BusRd observed in M state: what transition to make? • Change to S: assume I’ll read again soon • good for mostly read data • what about “migratory” data, thus: • Change to I: assume other will write to it (Synapse) • I read and write, then you read and write, then X reads and writes... • Sequent Symmetry and MIT Alewife use adaptive protocols ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
MESI (4-state) Invalidation Protocol • Problem with MSI protocol • Rd, Wr sequence incurs 2 transactions • even when no one is sharing (e.g., serial program!) • BusRd (I S) followed by BusRdX or BusUpgr (S M) • In general, coherence traffic from serial programs is unacceptable • Add exclusive state: • Invalid • Modified (dirty) • Shared (two or more caches may have copies) • Exclusive (only this cache has clean copy, same value as in memory) • How to decide I E or I S? • Need to check whether someone else has copy • “Shared” signal on bus: wired-or line asserted in response to BusRd ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
PrRd/– PrRd/– PrWr/– M S E I PrWr/– PrWr/BusRdX PrRd/BusRd(~S) PrWr/BusRdX PrRd/BusRd(S) PrRd/– MESI: Processor-Initiated Transactions ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
M I E S BusRdX/Flush BusRd/Flush BusRd/Flush BusRdX/Flush BusRdX/Flush׳ BusRd/– BusRdX/– BusRd/Flush׳ MESI: Bus-Initiated Transactions ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
MESI State Transition Diagram • BusRd(S) means shared line asserted on BusRd transaction ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Flush vs. Flush' • Flush: mandatory • Flush'happens only when • Cache-to-cache sharing is used, and, • Only one cache flushes data ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 MESI Visualization P1 P3 Cache Snooper Snooper Snooper Bus Mem Ctrl Main Memory X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X BusRd MESI Visualization P1 P3 Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 MESI Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 wr &X (X=2) M 2 MESI Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl One less bus request due to Exclusive state, esp. for serial programs X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X BusRd MESI Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 S X=2 S Flush 2 MESI Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 wr &X X=3 I 3 M BusUpgr MESI Visualization P1 P3 X=2 S X=2 S Snooper Snooper Snooper Mem Ctrl Note: BusUpgr instead of BusRdX X=2 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X 3 S Flush BusRd 3 S MESI Visualization P1 P3 X=2 I X=3 M Snooper Snooper Snooper Mem Ctrl X=2 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X MESI Visualization P1 P3 X=3 S X=3 S Snooper Snooper Snooper Mem Ctrl X=3 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
rd &X X=3 S BusRd Flush1 MESI Visualization P1 P2 P3 X=3 S X=3 S Snooper Snooper Snooper Mem Ctrl Referred to as Cache-to-cache transfer in Illinois MESI protocol X=3 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
MESI Example (Cache-to-Cache Transfer) * Data from memory if no cache2cache transfer, BusRd/- ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
MESI Example (Cache-to-Cache Transfer+BusUpgr) * Data from memory if no cache2cache transfer, BusRd/- ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Lower-Level Protocol Choices • Who supplies data on miss when not in M state: memory or cache? • Original, lllinois MESI: cache • assume cache faster than memory (cache-to-cache transfer) • Not necessarily true • Adds complexity • How does memory know it should supply data? (must wait for caches) • Selection algorithm if multiple caches have valid data • Valuable for distributed memory • May be cheaper to obtain from nearby cache than distant memory • Especially when constructed out of SMP nodes (Stanford DASH) ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Lecture 9 Outline • MESI protocol • Dragon update-based protocol • Impact of protocol optimizations ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Dragon Writeback Update Protocol • Four states • Exclusive-clean (E): I and memory have it • Shared clean (Sc): I, others, and maybe memory, but I’m not owner • Shared modified (Sm): I and others but not memory, and I’m the owner • Sm and Sc can coexist in different caches, with at most one Sm • Modified or dirty (M): I and, no one else • On replacement: Sc can silently drop, Sm has to flush • No invalid state • If in cache, cannot be invalid • If not present in cache, can view as being in not-present or invalid state • New processor events: PrRdMiss, PrWrMiss • Introduced to specify actions when block not present in cache • New bus transaction: BusUpd • Broadcasts single word written on bus; updates other relevant caches ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
E M Sc Sm Dragon: Processor-Initiated Transactions PrRd/– PrRd/– PrRdMiss/BusRd(~S) PrWr/BusUpd(S) PrRdMiss/BusRd(S) PrWr/BusUpd(~S) PrWr/– PrWrMiss/ (BusRd(S);BusUpd) PrRdMiss/BusRd(~S) PrWr/BusUpd(~S) PrRd/– PrRd/– PrWr/BusUpd(S) PrWr/– ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
E M Sc Sm Dragon: Bus-Initiated Transactions BusRd/– BusUpd/Update BusRd/– BusUpd/Update BusRd/Flush BusRd/Flush ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
PrRd/— BusUpd/Update PrRd/— BusRd/— Sc E PrRdMiss/ BusRd(S) PrRdMiss/ BusRd(S) PrWr/— PrWr/BusUpd(S) PrWr/ BusUpd(S) BusUpd/Update PrWrMiss/ (BusRd(S); BusUpd) PrWrMiss/ BusRd(S) BusRd/Flush Sm M PrWr/BusUpd(S) PrRd/— PrRd/— PrWr/BusUpd(S) PrWr/— BusRd/Flush Dragon State Transition Diagram ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 Dragon Visualization P1 P3 Cache Snooper Snooper Snooper Bus Mem Ctrl Main Memory X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X BusRd Dragon Visualization P1 P3 Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 Dragon Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 wr &X (X=2) M 2 Dragon Visualization P1 P3 X=1 E Snooper Snooper Snooper Mem Ctrl One less bus request due to Exclusive state, esp. for serial programs X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X BusRd Dragon Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 Sm X=2 Sc Dragon Visualization P1 P3 X=2 M Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 wr &X X=3 3 Sc 3 Sm BusUpd Dragon Visualization P1 P3 X=2 Sm X=2 Sc Snooper Snooper Snooper Mem Ctrl Note: BusUpdate instead of BusUpgr (no inval is performed) X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X Dragon Visualization P1 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl This is a miss in the MESI and MSI protocols X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
P2 rd &X Dragon Visualization P1 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
rd &X X=3 Sc BusRd Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl Note: Only the cache in State Sm is responsible for cache-to-cache transfer X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
X=3 Sc Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl P1 replaces X X=1 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
X=3 Sc Dragon Visualization P1 P2 P3 X=3 Sc X=3 Sm Snooper Snooper Snooper Mem Ctrl vs. MSI or MESI where write-back only when the line is in M state P3 replaces X Owner responsible for writing back to mem X=1 3 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Dragon Example ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Lower-Level Protocol Choices • Can shared-modified state be eliminated? • If update memory as well on BusUpd transactions (DEC Firefly) • Dragon protocol doesn’t (assumes DRAM memory slow to update) • Should replacement of an Sc block be broadcast? • Would allow last copy to go to Exclusive state and not generate updates • Replacement bus transaction is not in critical path, later update may be • Shouldn’t update local copy on write hit before controller gets bus • Can mess up serialization • Coherence, consistency considerations much like write-through case • In general, many subtle race conditions in protocols • But first, let’s illustrate quantitative assessment at logical level ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Lecture 9 Outline • MESI protocol • Dragon update-based protocol • Impact of protocol optimizations ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Assessing Protocol Tradeoffs • Methodology: • Use simulator; choose parameters per earlier methodology (default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some) • Focus on frequencies, not end performance for now • transcends architectural details, but not what we’re really after • Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters • Cheap simulation: no need to model contention ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
2 0 0 A d d r e s s b u s 1 8 0 D a t a b u s 1 6 0 ) 1 4 0 s / B M 8 0 1 2 0 ( c 7 0 A d d r e s s b u s i ) f 1 0 0 s f D a t a b u s / a 6 0 B r M T 8 0 ( 5 0 c i 4 0 6 0 f f a 3 0 r T 4 0 2 0 2 0 1 0 l x l x l t E x t d E I E 0 0 LU/III LU/3St Barnes/III Radix/III Ocean/3S Ocean/III OS-Data/III Radix/3St Appl-Data/III OS-Code/III Barnes/3St Raytrace/III OS-Data/3St Radiosity/III Appl-Code/III OS-Code/3St Appl-Data/3St Raytrace/3St LU/3St-RdEx Appl-Code/3St Radiosity/3St Barnes/3St-RdEx OS-Data/3St-RdEx Radix/3St-RdEx Ocean/3St-RdEx OS-Code/3St-RdEx Appl-Data/3St-RdEx Appl-Code/3St-RdEx Raytrace/3St-RdEx Radiosity/3St-RdEx Impact of Protocol Optimizations MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX) • MSI = MESI • Upgrades instead of read-exclusive helps • Same story when working sets don’t fit for Ocean, Radix, Raytrace ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
Impact of Cache-Block Size • Multiprocessors add new kind of miss to cold, capacity, conflict • Coherence misses: Due to invalidations • True sharing: Write to same word • False sharing: Write to different words • Reducing misses architecturally in invalidation protocol • Capacity: enlarge cache; increase block size (if spatial locality) • Conflict: increase associativity • Cold and coherence: only block size • Increasing block size has advantages and disadvantages • Can reduce misses if spatial locality is good • Can hurt too • increase misses due to false sharing if spatial locality not good • increase misses due to conflicts in fixed-size cache • increase traffic due to fetching unnecessary data and due to false sharing • can increase miss penalty and perhaps hit cost ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
0 . 6 1 2 U p g r a d e U p g r a d e F a l s e s h a r i n g F a l s e s h a r i n g 0 . 5 1 0 T r u e s h a r i n g T r u e s h a r i n g C a p a c i t y C a p a c i t y C o l d C o l d 0 . 4 8 Miss rate (%) Miss rate (%) 0 . 3 6 0 . 2 4 0 . 1 2 8 8 0 8 6 6 2 4 8 0 Lu/64 Lu/8 Lu/16 Lu/32 Lu/128 Lu/256 Radix/8 Ocean/8 Radix/16 Radix/32 Radix/64 Barnes/8 Radiosity/8 Radiosity/32 Ocean/16 Ocean/64 Radix/128 Radix/256 Ocean/32 Barnes/32 Barnes/64 Barnes/16 Raytrace/8 Ocean/128 Ocean/256 Barnes/128 Barnes/256 Radiosity/256 Raytrace/16 Raytrace/32 Raytrace/64 Radiosity/16 Radiosity/64 Raytrace/128 Raytrace/256 Radiosity/128 Impact of Block Size on Miss Rate • For default problem size: vary block/line size from 8-256 Bytes • Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality) • Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix) ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
0 . 1 8 A d d r e s s b u s 0 . 1 6 D a t a b u s 0 . 1 4 0 . 1 2 Traffic (bytes/instructions) 0 . 1 0 . 0 8 0 . 0 6 0 . 0 4 0 . 0 2 8 2 4 2 0 Barnes/8 Barnes/256 Barnes/16 Barnes/32 Barnes/64 Raytrace/8 Radiosity/8 Barnes/128 Raytrace/32 Raytrace/64 Raytrace/16 Radiosity/16 Radiosity/32 Radiosity/64 Raytrace/128 Raytrace/256 Radiosity/128 Radiosity/256 Impact of Block Size on Traffic Traffic (bytes/inst) affects performance indirectly through contention • Results different than for miss rate: traffic almost always increases • When working sets fits, overall traffic still small, except for Radix • Fixed overhead is significant component • So total traffic often minimized at 16-32 byte block, not smaller • Working set doesn’t fit: even 128-byte good for Ocean due to capacity • Address bus traffic behaves in opposite way as the data bus traffic ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin