1 / 48

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures. Reetuparna Das , Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan , Ravi Iyer * , Mazin S. Yousif * , Chita R. Das. *. Why On-Chip Networks ?.

abner
Download Presentation

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance and Power Optimization through Data Compression inNetwork-on-Chip Architectures Reetuparna Das, Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravi Iyer*, Mazin S. Yousif *, Chita R. Das *

  2. Why On-Chip Networks ? Relative Delays based on the 2005 ITRS • Global interconnect delays not scaling … • No longer single cycle global wire traversal! • March to Multicores … • We need controlled structured low-power communication fabric 250 nm Gate Delay Global Wiring 32 nm Global Wiring Delay Dominates

  3. What is Network-on-Chip ? R R R R R R R R R R R R R R R R R R R R R R R R

  4. Is Network Latency Critical ? • in-net : on-chip interconnect • off-chip : main memory access • other : cache access and queuing Network Intensive ? Memory Intensive ? Balanced ? Average Memory Response Time Break-down Up to 50 % of Memory Latency can come from Network !

  5. NUCA specific Network-on-Chip Design Challenges… • NoC have high bandwidth demands. • NoC specific to NUCA are latency critical • Latency directly effects memory access time • NoC consumes significant system power • NoC design is area constrained • Need to Reduce buffering requirements. • Goal is to use compression to minimize latency, power, bandwidth and buffer requirements

  6. Compression ? • Is compression a viable solution for NoC ? • Application profiling shows extensive value locality in NUCA Data Traffic (e.g. > 50% zero patterns!) • Associated Overheads • Compression / Decompression Latency • Increased cache access latency to load/store variable length cache blocks

  7. Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *

  8. Compressing Frequent Data Patterns • One 32 bit segment in a cache block is compressed at a time • Fewer bits encode a frequently occurring pattern • Variable Length encoding • Each pattern => unique prefix + significant bits • Compress in single cycle • Parallel Compressor Circuit • Decompression takes five cycle latency • Serialized due to variable length encoding Alameldeen et al, ISCA 2005

  9. Compressing Frequent Data Patterns CACHE BLOCK / NETWORK PACKET (512 bits) PATTERN MATCH ? COMPRESSED SEGMENT (3 -37 bits) PREFIX (3 bit) DATA(0-32bits) UNCOMPRESSED SEGMENT (32 bits)

  10. Compressing Frequent Data Patterns CACHE BLOCK / NETWORK PACKET PREFIX UNIQUELY IDENTIFIES PATTERNS COMPRESSED BLOCK PARALLEL PATTERN MATCH ? FIVE STAGE DECOMPRESSION PIPELINE DATA SEGMENTS VARIABLE LENGTH --- CANNOT MAKE DECOMPRESSION PARALLEL NEED TO KNOW STARTING ADDRESS OFEACH COMPRESSED SEGMENT OPTIMIZED FIVE STAGE PIPELINE 1 cycle 5 cycle

  11. Frequent Data Patterns • Frequent Value Locality • Zhang et al, ASPLOS 2000 , MICRO 2000

  12. Frequent Data Patterns • Frequent Value Locality • Zhang et al, ASPLOS 2000 , MICRO 2000

  13. Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *

  14. NICCompression(NC) Cache Compression (CC) CPU CPU L1 L1 NoC L2 L2 L2

  15. Cache Compression (CC) vs. NIC Compression(NC) • NIC Compression • <+> Send Compressed Data over Network • <-> Overhead (1) Compressor (2) Decompressor • Cache compression • <+> Send Compressed Data over Network • Store the Compressed Data in L2 cache • <+> Increase Cache Capacity • <-> Overhead (1) Compressor (2) Decompressor (3) Variable Line Cache Architecture

  16. Cache Compression (CC) L2 Stores compressed variable length cache blocks R R R R R R R R R CPU L1 R R R Compression/ Decompression R R R NIC R R R

  17. Cache Compression (CC) CPU Node L2 Cache Bank Node • Compressor penalty for L1 write backs ( 1 cycle) • Network communicates compressed data • L2 Cache stores compressed variable length cache blocks • 2 cycle additional L2 hit latency to store/load • Decompression penalty for every L1 misses ( 5 cycle) CPU Variable length cache blocks L1 NIC Comp/ Decomp R Router R NIC Router Compressed Data

  18. Variable Line Cache (CC Scheme) UNCOMPRESSED CACHE SET COMPRESSED CACHE SET LINE 0 LINE 1 LINE 2 LINE 3 LINE 4 LINE 5 LINE 6 FIXED (LINE SIZE 64 bytes) : ADDR LINE 2 = BASE + 0x80 VARIABLE : ADDR LINE 2 = BASE + LENGTH LINE 0 + LENGTH LINE 1 All Lines need to be contiguous -- or address calculation will not work Need to do compaction on eviction or fat writes

  19. Variable Line Cache (CC Scheme) 8 way set Ξ 16x8 segments in set

  20. Variable Line Cache (CC Scheme) Off Critical Path Overhead : 2 cycles to Hit Latency

  21. Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *

  22. NIC Compression (NC) R R R R R R R R R R R R R R R R R R R R R R R R

  23. NIC Compression (NC) CPU Node Cache Bank Node • No Modifications to L2 cache, • stores uncompressed data • Network communicates compressed Data • Decompression penalty for each Data Network transfer ( 5 cycle) • Compressor for each Data Network transfer ( 1 cycle) L2 CPU Uncompressed Data L1 Comp/ Decomp in NIC Uncompressed Data Comp/ Decomp in NIC Compressed Data R R

  24. NC Decompression Optimization TO L1/PROCESSOR CYCLES 0 1 2 3 4 5 6 7 8 9 FLITS DECOMPRESSION CYCLES TO L1/PROCESSOR CYCLES 0 1 2 3 4 5 6 FLITS PRECOMPUTATION STAGES (STAGES 1,2 and 3) STAGES 4 and 5 SAVINGS PER MSSG. (3 CYCLES) Overlap of Decompression & Communication Latency : An Example Time Line for a Five Flit Message (Reduce Decompression Latency to 2 cycles from 5 cycles)

  25. Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *

  26. Experimental Platform • Detailed trace-driven cycle-accurate hybrid NoC / Cache simulator for CMP architectures • MESI-based coherence protocol with distributed directories • Router models two-stage pipeline, wormhole flow control, finite input buffering and deterministic X-Y routing. • Interconnect power estimations from synthesis in Synopsys Design Compiler using a TSMC 90 nm standard cell library • Memory Traces generated from Simics Full system simulator

  27. Workloads • Commercial • TPCW • SPECJBB • Static Web serving: Apache and Zeus • SPECOMP Benchmarks • SPLASH-2 • MediaBench-II

  28. System Configuration

  29. Compressed Packet Length Compression ratio for packet up to 60% !

  30. Packet Latency Interconnect Latency reduction up to 33%

  31. Network Latency Breakdown Interconnect Latency Breakdown

  32. Network Latency Breakdown Interconnect Latency Breakdown

  33. Network Latency Breakdown Major reductions in queuing latency and blocking latency Average reduction of 21%

  34. Network Power Network Power Reduction up to 22%

  35. Buffer Utilization Normalized Router Buffer Utilization

  36. Memory Response Time Memory Response Time reduction up to 20%

  37. System Performance Normalized CPI Reduction up to 15%

  38. Scalability Study All scalability results for SPECJBB

  39. Conclusions • Compression is a simple and useful technique to optimize performance and power of OCN’s. • CC (NC) scheme provides on an average 21% (20%) reduction in network latency with maximum savings of 33%(32%). • Network power consumption is minimized by an average of 7% (23% maximum) • Average 7% reduction in CPI compared to the generic case by reducing network latency.

  40. Thank you! Questions ?

  41. Backup

  42. Basics : Router Architecture Input Port with Buffers VC Identifier Control Logic VC 0 From East Routing Unit VC 1 ( RC ) VC Allocator VC 2 ( VA ) Switch VC 0 From West Allocator ( SA ) VC 1 VC 2 To East VC 0 From North VC 1 To West VC 2 To North To South VC 0 From South To PE VC 1 VC 2 Crossbar ( 5 x 5 ) Crossbar VC 0 From PE VC 1 VC 2

  43. L2 Miss Rates

  44. Decompression Pipeline

  45. Scalability with flit width All scalability results for SPECJBB

  46. Scalability with System Size All scalability results for SPECJBB

  47. Scalability with System Size 16p CMP larger network detrimental for compression. 16p CMP more processors higher injection/load on network, compression should help. 8p - 24 8p - 40 8p - 72 16p - 32 16p - 48 16p - 80 Network Size :

  48. Scalability with System Size CC relative to NC Does better for more processors. Cache Compression helps more.

More Related