360 likes | 443 Views
Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors. Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur. Talk in Two Slides (1/2). Memory footprint of data-intensive workloads is ever-increasing
E N D
Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur
Talk in Two Slides (1/2) • Memory footprint of data-intensive workloads is ever-increasing • We explore compression to reduce memory pressure in a medium-scale DSM multi • Dirty blocks evicted from last-level of cache is sent to home node • Compress in home memory controller • A last-level cache miss request from a node is sent to home node • Decompress in home memory controller Memory Compression and Decompression
Talk in Two Slides (2/2) • No modification in the processor • Cache hierarchy sees decompressed blocks • All changes are confined to the directory-based cache coherence protocol • Leverage spare core(s) to execute compression-enabled protocols in software • Extend directory structure for compression book-keeping • Use hybrid of two compression algorithms • On 16 nodes for seven scientific computing workloads, 73% storage saving on average with at most 15% increase in execution time Memory Compression and Decompression
Contributions • Two major contributions • First attempt to look at compression/decompression as directory protocol extensions in mid-range servers • First proposal to execute a compression-enabled directory protocol in software on spare core(s) of a multi-core die • Makes the solution attractive in many-core systems Memory Compression and Decompression
Sketch • Background: Programmable Protocol Core • Directory Protocol Extensions • Compression/Decompression Algorithms • Simulation Results • Related Work and Summary Memory Compression and Decompression
Programmable Protocol Core • Past studies have considered off-die programmable protocol processors • Offers flexibility in choice of coherence protocols compared to hardwired FSMs, but suffers from performance loss [Sun S3.mp, Sequent STiNG, Stanford FLASH, Piranha, …] • With on-die integration of memory controller and availability of large number of on-die cores, programmable protocol cores may become an attractive design • Recent studies show almost no performance loss [IEEE TPDS, Aug’07] Memory Compression and Decompression
Programmable Protocol Core • In our simulated system, each node contains • One complex out-of-order issue core which runs the application thread • One or two simple in-order static dual issue programmable protocol core(s) which run the directory-based cache coherence protocol in software • On-die integrated memory controller, network interface, and router • Compression/decompression algorithms are integrated into the directory protocol software Memory Compression and Decompression
Programmable Protocol Core OOO Core In-order Core Protocol Core/ Protocol Processor AT PT IL1 DL1 IL1 DL1 SDRAM L2 Memory Control Network Router Memory Compression and Decompression
Anatomy of a Protocol Handler • On arrival of a coherence transaction at the memory controller of a node, a protocol handler is scheduled on the protocol core of that node • Calculates the directory address if home node (simple hash function on transaction address) • Reads 64-bit directory entry if home node • Carries out simple integer arithmetic operations to figure out coherence actions • May send messages to remote nodes • May initiate transactions to local OOO core Memory Compression and Decompression
Baseline Directory Protocol • Invalidation-based three-state (MSI) bitvector protocol • Derived from SGI Origin MESI protocol and improved to handle early and late intervention races better 64-bit datapath 4 44 16 Unused States: L, M, two busy Sharer vector Memory Compression and Decompression
Sketch • Background: Programmable Protocol Core • Directory Protocol Extensions • Compression/Decompression Algorithms • Simulation Results • Related Work and Summary Memory Compression and Decompression
Directory Protocol Extensions • Compression support • All handlers that update memory blocks need extension with compression algorithm • Two major categories: writeback handlers and GET intervention response handlers • Latter involves a state demotion from M to S and hence requires an update of memory block at home • GETX interventions do not require memory update as they involve ownership hand-off only • Decompression support • All handlers that access memory in response to last-level cache miss requests Memory Compression and Decompression
Directory Protocol Extensions • Compression support (writeback cases) WB SPP HPP DRAM Compress WB_ACK WB SP Memory Compression and Decompression
Directory Protocol Extensions • Compression support (writeback cases) WB HP HPP DRAM Compress Memory Compression and Decompression
Directory Protocol Extensions DRAM • Compression support (intervention cases) GET GET HPP DP RPP SWB Compress GET PUT RP PUT Memory Compression and Decompression
Directory Protocol Extensions DRAM • Compression support (intervention cases) GET GET HPP HP RPP Compress PUT PUT (Uncompressed) GET PUT RP Memory Compression and Decompression
Directory Protocol Extensions DRAM • Compression support (intervention cases) GET GET HPP DP HP Compress PUT PUT (Uncompressed) Memory Compression and Decompression
Directory Protocol Extensions • Decompression support GET/GETX RPP HPP DRAM Decompress PUT/PUTX GET/GETX PUT/PUTX RP Memory Compression and Decompression
Directory Protocol Extensions • Decompression support GET/GETX HP HPP DRAM Decompress PUT/PUTX Memory Compression and Decompression
Sketch • Background: Programmable Protocol Core • Directory Protocol Extensions • Compression/Decompression Algorithms • Simulation Results • Related Work and Summary Memory Compression and Decompression
Compression Algorithms • Consider each 64-bit chunk at a time of a 128-byte cache block Algorithm I Original Compressed Encoding All zero Zero byte 00 MS 4 bytes zero LS 4 bytes 01 MS 4 bytes = LS 4 bytes LS 4 bytes 10 None 64 bits 11 Algorithm II Differs in encoding 10: LS 4 bytes zero. Compressed block stores the MS 4 bytes. Memory Compression and Decompression
Compression Algorithms • Ideally want to compute compressed size by both the algorithms for each of the 16 double-words in a cache block and pick the best • Overhead is too high • Trade-off#1 • Speculate based on the first 64 bits • If MS 32 bits ^ LS 32 bits = 0, use Algorithm I (covers two cases of Algorithm I) • If MS 32 bits & LS 32 bits = 0, use Algorithm II (covers three cases of Algorithm II) Memory Compression and Decompression
Compression Algorithms • Trade-off#2 • If compression ratio is low, it is better to avoid decompression overhead • Decompression is fully on the critical path • After compressing every 64 bits, compare the running compressed size against a threshold maxCsz (best: 48 bytes) • Abort compression and store entire block uncompressed as soon as the threshold is crossed Memory Compression and Decompression
Compression Algorithms • Meta-data • Required for decompression • Most meta-data are stored in the unused 44 bits of the directory entry • Cache controller generates uncompressed block address; so directory address computation remains unchanged • 32 bits to locate the compressed block • Compressed block size is a multiple of 4 bytes, but we extend it to next 8-byte boundary to have a cushion for future use • 32 bits allow us to address 32 GB of compressed memory Memory Compression and Decompression
Compression Algorithms • Meta-data • Two bits to know the compression algorithm • Algorithm I, Algorithm II, uncompressed, all zero • All zero blocks do not store anything in memory • For each 64 bits need to know one of four encodings • Maintained in a 32-bit header (two bits for each of the 16 double words) • Optimization to speed up relocation: store the size of the compressed block in directory entry • Requires four bits (16 double words maximum) • 70 bits of meta-data per compressed block Memory Compression and Decompression
Decompression Example • Directory entry information • 32-bit address: 0x4fd1276a • Actual address = 0x4fd1276a << 3 • Compression state: 01 • Algorithm II was used • Compressed size: 0101 • Actual size=40 bytes (not used in decompression) • Header information • 32-bit header: 00 11 10 00 00 01… • Upper 64 bits used encoding 00 of Algorithm II • Next 64 bits used encoding 11 of Algorithm II Memory Compression and Decompression
Performance Optimization • Protocol thread occupancy is critical • Two protocol cores • Out-of-order NI scheduling to improve protocol core utilization • Cached message buffer (filled with writeback payload) • 16 uncached loads/stores needed to message buffer if not cached during compression • Caching requires invalidating the buffer contents at the end of compression (coherence issue) • Flushing dirty contents occupies the datapath; so we allow only cached loads • Compression ratio remains unaffected Memory Compression and Decompression
Sketch • Background: Programmable Protocol Core • Directory Protocol Extensions • Compression/Decompression Algorithms • Simulation Results • Related Work and Summary Memory Compression and Decompression
Storage Saving 80% 73% 66% 60% 40% 21% 16% 20% 0% Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression
1PP Slowdown 2PP 2PP+OOO NI 2PP+OOO NI+CLS 2PP+OOO NI+CL 1.60 2% 5% 7% 1% 11% 15% 8% 1.45 1.30 1.15 1.00 Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression
Memory Stall Cycles Memory Compression and Decompression
Protocol Core Occupancy • Dynamic instruction count and handler occupancy w/o compression w/ compression Barnes 29.1 M (7.5 ns) 215.5 M (31.9 ns) FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns) FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns) LU 11.4 M (6.3 ns) 29.2 M (14.8 ns) Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns) Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns) Water 62.4 M (5.5 ns) 137.3 M (8.8 ns) Occupancy still hidden under fastest memory access (40 ns) Memory Compression and Decompression
Sketch • Background: Programmable Protocol Core • Directory Protocol Extensions • Compression/Decompression Algorithms • Simulation Results • Related Work and Summary Memory Compression and Decompression
Related Work • Dictionary-based • IBM MXT • X-Match • X-RL • Not well-suited for cache block grain • Frequent pattern-based • Applied to on-chip cache blocks • Zero-aware compression • Applied to memory blocks • See paper for more details Memory Compression and Decompression
Summary • Explored memory compression and decompression as coherence protocol extensions in DSM multiprocessors • The compression-enabled handlers run on simple core(s) of a multi-core node • The protocol core occupancy increases significantly, but still can be hidden under memory access latency • On seven scientific computing workloads, our best design saves 16% to 73% memory while slowing down execution by at most 15% Memory Compression and Decompression
Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors THANK YOU! Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur