510 likes | 819 Views
Performance and Power Optimization through Data Compression in Network-on-Chip Architectures. Reetuparna Das , Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan , Ravi Iyer * , Mazin S. Yousif * , Chita R. Das. *. Why On-Chip Networks ?.
E N D
Performance and Power Optimization through Data Compression inNetwork-on-Chip Architectures Reetuparna Das, Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravi Iyer*, Mazin S. Yousif *, Chita R. Das *
Why On-Chip Networks ? Relative Delays based on the 2005 ITRS • Global interconnect delays not scaling … • No longer single cycle global wire traversal! • March to Multicores … • We need controlled structured low-power communication fabric 250 nm Gate Delay Global Wiring 32 nm Global Wiring Delay Dominates
What is Network-on-Chip ? R R R R R R R R R R R R R R R R R R R R R R R R
Is Network Latency Critical ? • in-net : on-chip interconnect • off-chip : main memory access • other : cache access and queuing Network Intensive ? Memory Intensive ? Balanced ? Average Memory Response Time Break-down Up to 50 % of Memory Latency can come from Network !
NUCA specific Network-on-Chip Design Challenges… • NoC have high bandwidth demands. • NoC specific to NUCA are latency critical • Latency directly effects memory access time • NoC consumes significant system power • NoC design is area constrained • Need to Reduce buffering requirements. • Goal is to use compression to minimize latency, power, bandwidth and buffer requirements
Compression ? • Is compression a viable solution for NoC ? • Application profiling shows extensive value locality in NUCA Data Traffic (e.g. > 50% zero patterns!) • Associated Overheads • Compression / Decompression Latency • Increased cache access latency to load/store variable length cache blocks
Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *
Compressing Frequent Data Patterns • One 32 bit segment in a cache block is compressed at a time • Fewer bits encode a frequently occurring pattern • Variable Length encoding • Each pattern => unique prefix + significant bits • Compress in single cycle • Parallel Compressor Circuit • Decompression takes five cycle latency • Serialized due to variable length encoding Alameldeen et al, ISCA 2005
Compressing Frequent Data Patterns CACHE BLOCK / NETWORK PACKET (512 bits) PATTERN MATCH ? COMPRESSED SEGMENT (3 -37 bits) PREFIX (3 bit) DATA(0-32bits) UNCOMPRESSED SEGMENT (32 bits)
Compressing Frequent Data Patterns CACHE BLOCK / NETWORK PACKET PREFIX UNIQUELY IDENTIFIES PATTERNS COMPRESSED BLOCK PARALLEL PATTERN MATCH ? FIVE STAGE DECOMPRESSION PIPELINE DATA SEGMENTS VARIABLE LENGTH --- CANNOT MAKE DECOMPRESSION PARALLEL NEED TO KNOW STARTING ADDRESS OFEACH COMPRESSED SEGMENT OPTIMIZED FIVE STAGE PIPELINE 1 cycle 5 cycle
Frequent Data Patterns • Frequent Value Locality • Zhang et al, ASPLOS 2000 , MICRO 2000
Frequent Data Patterns • Frequent Value Locality • Zhang et al, ASPLOS 2000 , MICRO 2000
Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *
NICCompression(NC) Cache Compression (CC) CPU CPU L1 L1 NoC L2 L2 L2
Cache Compression (CC) vs. NIC Compression(NC) • NIC Compression • <+> Send Compressed Data over Network • <-> Overhead (1) Compressor (2) Decompressor • Cache compression • <+> Send Compressed Data over Network • Store the Compressed Data in L2 cache • <+> Increase Cache Capacity • <-> Overhead (1) Compressor (2) Decompressor (3) Variable Line Cache Architecture
Cache Compression (CC) L2 Stores compressed variable length cache blocks R R R R R R R R R CPU L1 R R R Compression/ Decompression R R R NIC R R R
Cache Compression (CC) CPU Node L2 Cache Bank Node • Compressor penalty for L1 write backs ( 1 cycle) • Network communicates compressed data • L2 Cache stores compressed variable length cache blocks • 2 cycle additional L2 hit latency to store/load • Decompression penalty for every L1 misses ( 5 cycle) CPU Variable length cache blocks L1 NIC Comp/ Decomp R Router R NIC Router Compressed Data
Variable Line Cache (CC Scheme) UNCOMPRESSED CACHE SET COMPRESSED CACHE SET LINE 0 LINE 1 LINE 2 LINE 3 LINE 4 LINE 5 LINE 6 FIXED (LINE SIZE 64 bytes) : ADDR LINE 2 = BASE + 0x80 VARIABLE : ADDR LINE 2 = BASE + LENGTH LINE 0 + LENGTH LINE 1 All Lines need to be contiguous -- or address calculation will not work Need to do compaction on eviction or fat writes
Variable Line Cache (CC Scheme) 8 way set Ξ 16x8 segments in set
Variable Line Cache (CC Scheme) Off Critical Path Overhead : 2 cycles to Hit Latency
Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *
NIC Compression (NC) R R R R R R R R R R R R R R R R R R R R R R R R
NIC Compression (NC) CPU Node Cache Bank Node • No Modifications to L2 cache, • stores uncompressed data • Network communicates compressed Data • Decompression penalty for each Data Network transfer ( 5 cycle) • Compressor for each Data Network transfer ( 1 cycle) L2 CPU Uncompressed Data L1 Comp/ Decomp in NIC Uncompressed Data Comp/ Decomp in NIC Compressed Data R R
NC Decompression Optimization TO L1/PROCESSOR CYCLES 0 1 2 3 4 5 6 7 8 9 FLITS DECOMPRESSION CYCLES TO L1/PROCESSOR CYCLES 0 1 2 3 4 5 6 FLITS PRECOMPUTATION STAGES (STAGES 1,2 and 3) STAGES 4 and 5 SAVINGS PER MSSG. (3 CYCLES) Overlap of Decompression & Communication Latency : An Example Time Line for a Five Flit Message (Reduce Decompression Latency to 2 cycles from 5 cycles)
Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *
Experimental Platform • Detailed trace-driven cycle-accurate hybrid NoC / Cache simulator for CMP architectures • MESI-based coherence protocol with distributed directories • Router models two-stage pipeline, wormhole flow control, finite input buffering and deterministic X-Y routing. • Interconnect power estimations from synthesis in Synopsys Design Compiler using a TSMC 90 nm standard cell library • Memory Traces generated from Simics Full system simulator
Workloads • Commercial • TPCW • SPECJBB • Static Web serving: Apache and Zeus • SPECOMP Benchmarks • SPLASH-2 • MediaBench-II
Compressed Packet Length Compression ratio for packet up to 60% !
Packet Latency Interconnect Latency reduction up to 33%
Network Latency Breakdown Interconnect Latency Breakdown
Network Latency Breakdown Interconnect Latency Breakdown
Network Latency Breakdown Major reductions in queuing latency and blocking latency Average reduction of 21%
Network Power Network Power Reduction up to 22%
Buffer Utilization Normalized Router Buffer Utilization
Memory Response Time Memory Response Time reduction up to 20%
System Performance Normalized CPI Reduction up to 15%
Scalability Study All scalability results for SPECJBB
Conclusions • Compression is a simple and useful technique to optimize performance and power of OCN’s. • CC (NC) scheme provides on an average 21% (20%) reduction in network latency with maximum savings of 33%(32%). • Network power consumption is minimized by an average of 7% (23% maximum) • Average 7% reduction in CPI compared to the generic case by reducing network latency.
Thank you! Questions ?
Basics : Router Architecture Input Port with Buffers VC Identifier Control Logic VC 0 From East Routing Unit VC 1 ( RC ) VC Allocator VC 2 ( VA ) Switch VC 0 From West Allocator ( SA ) VC 1 VC 2 To East VC 0 From North VC 1 To West VC 2 To North To South VC 0 From South To PE VC 1 VC 2 Crossbar ( 5 x 5 ) Crossbar VC 0 From PE VC 1 VC 2
Scalability with flit width All scalability results for SPECJBB
Scalability with System Size All scalability results for SPECJBB
Scalability with System Size 16p CMP larger network detrimental for compression. 16p CMP more processors higher injection/load on network, compression should help. 8p - 24 8p - 40 8p - 72 16p - 32 16p - 48 16p - 80 Network Size :
Scalability with System Size CC relative to NC Does better for more processors. Cache Compression helps more.