Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur

Talk in Two Slides (1/2) • Memory footprint of data-intensive workloads is ever-increasing • We explore compression to reduce memory pressure in a medium-scale DSM multi • Dirty blocks evicted from last-level of cache is sent to home node • Compress in home memory controller • A last-level cache miss request from a node is sent to home node • Decompress in home memory controller Memory Compression and Decompression

Talk in Two Slides (2/2) • No modification in the processor • Cache hierarchy sees decompressed blocks • All changes are confined to the directory-based cache coherence protocol • Leverage spare core(s) to execute compression-enabled protocols in software • Extend directory structure for compression book-keeping • Use hybrid of two compression algorithms • On 16 nodes for seven scientific computing workloads, 73% storage saving on average with at most 15% increase in execution time Memory Compression and Decompression

Contributions • Two major contributions • First attempt to look at compression/decompression as directory protocol extensions in mid-range servers • First proposal to execute a compression-enabled directory protocol in software on spare core(s) of a multi-core die • Makes the solution attractive in many-core systems Memory Compression and Decompression

Sketch • Background: Programmable Protocol Core • Directory Protocol Extensions • Compression/Decompression Algorithms • Simulation Results • Related Work and Summary Memory Compression and Decompression

Programmable Protocol Core • Past studies have considered off-die programmable protocol processors • Offers flexibility in choice of coherence protocols compared to hardwired FSMs, but suffers from performance loss [Sun S3.mp, Sequent STiNG, Stanford FLASH, Piranha, …] • With on-die integration of memory controller and availability of large number of on-die cores, programmable protocol cores may become an attractive design • Recent studies show almost no performance loss [IEEE TPDS, Aug’07] Memory Compression and Decompression

Programmable Protocol Core • In our simulated system, each node contains • One complex out-of-order issue core which runs the application thread • One or two simple in-order static dual issue programmable protocol core(s) which run the directory-based cache coherence protocol in software • On-die integrated memory controller, network interface, and router • Compression/decompression algorithms are integrated into the directory protocol software Memory Compression and Decompression

Programmable Protocol Core OOO Core In-order Core Protocol Core/ Protocol Processor AT PT IL1 DL1 IL1 DL1 SDRAM L2 Memory Control Network Router Memory Compression and Decompression

Anatomy of a Protocol Handler • On arrival of a coherence transaction at the memory controller of a node, a protocol handler is scheduled on the protocol core of that node • Calculates the directory address if home node (simple hash function on transaction address) • Reads 64-bit directory entry if home node • Carries out simple integer arithmetic operations to figure out coherence actions • May send messages to remote nodes • May initiate transactions to local OOO core Memory Compression and Decompression

Baseline Directory Protocol • Invalidation-based three-state (MSI) bitvector protocol • Derived from SGI Origin MESI protocol and improved to handle early and late intervention races better 64-bit datapath 4 44 16 Unused States: L, M, two busy Sharer vector Memory Compression and Decompression

Directory Protocol Extensions • Compression support • All handlers that update memory blocks need extension with compression algorithm • Two major categories: writeback handlers and GET intervention response handlers • Latter involves a state demotion from M to S and hence requires an update of memory block at home • GETX interventions do not require memory update as they involve ownership hand-off only • Decompression support • All handlers that access memory in response to last-level cache miss requests Memory Compression and Decompression

Directory Protocol Extensions • Compression support (writeback cases) WB SPP HPP DRAM Compress WB_ACK WB SP Memory Compression and Decompression

Directory Protocol Extensions • Compression support (writeback cases) WB HP HPP DRAM Compress Memory Compression and Decompression

Directory Protocol Extensions DRAM • Compression support (intervention cases) GET GET HPP DP RPP SWB Compress GET PUT RP PUT Memory Compression and Decompression

Directory Protocol Extensions DRAM • Compression support (intervention cases) GET GET HPP HP RPP Compress PUT PUT (Uncompressed) GET PUT RP Memory Compression and Decompression

Directory Protocol Extensions DRAM • Compression support (intervention cases) GET GET HPP DP HP Compress PUT PUT (Uncompressed) Memory Compression and Decompression

Directory Protocol Extensions • Decompression support GET/GETX RPP HPP DRAM Decompress PUT/PUTX GET/GETX PUT/PUTX RP Memory Compression and Decompression

Directory Protocol Extensions • Decompression support GET/GETX HP HPP DRAM Decompress PUT/PUTX Memory Compression and Decompression

Compression Algorithms • Consider each 64-bit chunk at a time of a 128-byte cache block Algorithm I Original Compressed Encoding All zero Zero byte 00 MS 4 bytes zero LS 4 bytes 01 MS 4 bytes = LS 4 bytes LS 4 bytes 10 None 64 bits 11 Algorithm II Differs in encoding 10: LS 4 bytes zero. Compressed block stores the MS 4 bytes. Memory Compression and Decompression

Compression Algorithms • Ideally want to compute compressed size by both the algorithms for each of the 16 double-words in a cache block and pick the best • Overhead is too high • Trade-off#1 • Speculate based on the first 64 bits • If MS 32 bits ^ LS 32 bits = 0, use Algorithm I (covers two cases of Algorithm I) • If MS 32 bits & LS 32 bits = 0, use Algorithm II (covers three cases of Algorithm II) Memory Compression and Decompression

Compression Algorithms • Trade-off#2 • If compression ratio is low, it is better to avoid decompression overhead • Decompression is fully on the critical path • After compressing every 64 bits, compare the running compressed size against a threshold maxCsz (best: 48 bytes) • Abort compression and store entire block uncompressed as soon as the threshold is crossed Memory Compression and Decompression

Compression Algorithms • Meta-data • Required for decompression • Most meta-data are stored in the unused 44 bits of the directory entry • Cache controller generates uncompressed block address; so directory address computation remains unchanged • 32 bits to locate the compressed block • Compressed block size is a multiple of 4 bytes, but we extend it to next 8-byte boundary to have a cushion for future use • 32 bits allow us to address 32 GB of compressed memory Memory Compression and Decompression

Compression Algorithms • Meta-data • Two bits to know the compression algorithm • Algorithm I, Algorithm II, uncompressed, all zero • All zero blocks do not store anything in memory • For each 64 bits need to know one of four encodings • Maintained in a 32-bit header (two bits for each of the 16 double words) • Optimization to speed up relocation: store the size of the compressed block in directory entry • Requires four bits (16 double words maximum) • 70 bits of meta-data per compressed block Memory Compression and Decompression

Decompression Example • Directory entry information • 32-bit address: 0x4fd1276a • Actual address = 0x4fd1276a << 3 • Compression state: 01 • Algorithm II was used • Compressed size: 0101 • Actual size=40 bytes (not used in decompression) • Header information • 32-bit header: 00 11 10 00 00 01… • Upper 64 bits used encoding 00 of Algorithm II • Next 64 bits used encoding 11 of Algorithm II Memory Compression and Decompression

Performance Optimization • Protocol thread occupancy is critical • Two protocol cores • Out-of-order NI scheduling to improve protocol core utilization • Cached message buffer (filled with writeback payload) • 16 uncached loads/stores needed to message buffer if not cached during compression • Caching requires invalidating the buffer contents at the end of compression (coherence issue) • Flushing dirty contents occupies the datapath; so we allow only cached loads • Compression ratio remains unaffected Memory Compression and Decompression

Storage Saving 80% 73% 66% 60% 40% 21% 16% 20% 0% Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression

1PP Slowdown 2PP 2PP+OOO NI 2PP+OOO NI+CLS 2PP+OOO NI+CL 1.60 2% 5% 7% 1% 11% 15% 8% 1.45 1.30 1.15 1.00 Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression

Memory Stall Cycles Memory Compression and Decompression

Protocol Core Occupancy • Dynamic instruction count and handler occupancy w/o compression w/ compression Barnes 29.1 M (7.5 ns) 215.5 M (31.9 ns) FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns) FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns) LU 11.4 M (6.3 ns) 29.2 M (14.8 ns) Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns) Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns) Water 62.4 M (5.5 ns) 137.3 M (8.8 ns) Occupancy still hidden under fastest memory access (40 ns) Memory Compression and Decompression

Related Work • Dictionary-based • IBM MXT • X-Match • X-RL • Not well-suited for cache block grain • Frequent pattern-based • Applied to on-chip cache blocks • Zero-aware compression • Applied to memory blocks • See paper for more details Memory Compression and Decompression

Summary • Explored memory compression and decompression as coherence protocol extensions in DSM multiprocessors • The compression-enabled handlers run on simple core(s) of a multi-core node • The protocol core occupancy increases significantly, but still can be hidden under memory access latency • On seven scientific computing workloads, our best design saves 16% to 73% memory while slowing down execution by at most 15% Memory Compression and Decompression

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors THANK YOU! Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

Presentation Transcript

Distributed Memory Multiprocessors

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Compression and Decompression Utility Software

Compression and Decompression

Shared Memory Multiprocessors

Cache Coherence Protocols

Cache Coherence in Shared Memory Multiprocessors

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Gzip Compression and Decompression

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Shared Memory Multiprocessors