The Effect of Interconnect Design on the Performance of Large L2 Caches

The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian University of Utah

Motivation: Large Caches • Future processors will have large on-chip caches • Intel Montecito has 24MB on-chip cache • Wire delay dominates in large caches • Conventional design can lead to very high hit time (CACTI access time for 24 MB cache is 90 cycles @ 5GHz, 65nm Tech) • Careful network choices • Improve access time • Open room for several other optimizations • Reduces power significantly University of Utah

Effect of L2 Hit Time 8-issue, out-of-order processor (L2-hit time 30-15cycles) Avg = 17% University of Utah

Cache Design Bitlines Input address Wordline Decoder Tag array Data array Column muxes Sense Amps Comparators Output driver Mux drivers Output driver Data output Valid output? University of Utah

Existing Model - CACTI Wordline & bitline delay Wordline & bitline delay Decoder delay Decoder delay Cache model with 4 sub-arrays Cache model with 16 sub-arrays University of Utah

Shortcomings • CACTI • Suboptimal for large cache size • Access delay is equal to the delay of slowest sub-array • Very high hit time for large caches • Employs a separate bus for each cache bank for multi-banked caches University of Utah

Non-Uniform Cache Access (NUCA) • Large cache is broken into a number of small banks • Employs on-chip network for communication • Access delay a (distance between bank and cache controller) CPU & L1 Cache banks University of Utah

Shortcomings • NUCA • Banks are sized such that the link latency is one cycle (Kim et al. ASPLOS 02) • Increased routing complexity • Dissipates more power University of Utah

Extension to CACTI • On-chip network • Wire model is done using ITRS 2005 parameters • Grid network • No. of rows = No. of columns (or ½ the no. of columns) • Network latency vs Bank access latency tradeoff • Modified the exhaustive search to include the network overhead University of Utah

Effect of Network Delay (32MB cache) Delay optimal point University of Utah

Outline • Overview • Cache Design • Effect of Network Delay • Wire Design Space • Exploiting Heterogeneous Wires • Results University of Utah

Wire Characteristics • Wire Resistance and capacitance per unit length University of Utah

Design Space Exploration • Tuning wire width and spacing Base case B wires Fast but Low bandwidth L wires (Width & Spacing)   Delay  Bandwidth  University of Utah

Design Space Exploration • Tuning Repeater size and spacing Power Optimal Wires Smaller repeaters Increased spacing Delay Power Traditional Wires Large repeaters Optimum spacing University of Utah

Design Space Exploration Base case B wires 8x plane Base case W wires 4x plane Power optimized PW wires 4x plane Fast, low bandwidth L wires 8x plane Latency 1x Power 1x Area 1x Latency 1.6x Power 0.9x Area 0.5x Latency 3.2x Power 0.3x Area 0.5x Latency 0.5x Power 0.5x Area 5x University of Utah

Access time for different link types University of Utah

Cache Look-Up Total cache access time Bank access Decoder, Wordline, Bitline delay (req 10-15 bits of address) Comparator, output driver delay (req remaining address for tag match) Network delay (req 6-8 bits to identify the cache Bank) The entire access happens in a sequential manner University of Utah

Early Look-Up • Send partial address in L-wires • Initiate the bank lookup • Wait for the complete address • Complete the access L Early lookup (req 10-15 bits of address) Tag match We can hide 60-70% of the bank access delay University of Utah

Aggressive Look-Up • Send partial address bits on L-wires • Do early look-up and do partial tag match • Send all the matched blocks aggressively Network delay reduced L Tag match at cache controller Agg. lookup (req additional 8-bits of address fpr partial tag match) University of Utah

Aggressive Look-Up • Significant reduction in network delay (for address transfer) • Increase in traffic due to false match < 1% • Marginal increase in link overhead • Additional 8-bits of L-wires compared to early lookup • Adds complexity to cache controller • Needs logic to do tag match University of Utah

Experimental Setup • Simplescalar with contention modeled in detail • Single core, 8-issue out-of-order processor • 32 MB, 8-way set-associative, on-chip L2 cache (SNUCA organization) • 32KB I-cache and 32KB D-cache with hit latency of 3 cycles • Main memory latency 300 cycles University of Utah

Cache Models University of Utah

Performance Results (Global Wires) Model 2 (CACTI-L2) : Average performance improvement – 11% Performance improvement for L2 latency sensitive benchmarks – 16.3% Model 3 (Early Lookup): Average performance improvement – 14.4% Performance improvement for L2 latency sensitive benchmarks – 21.6% Model 4 (Aggressive Lookup): Average performance improvement – 17.6% Performance improvement for L2 latency sensitive benchmarks – 26.6% Model 6 (L-Network): Average performance improvement – 11.4% Performance improvement for L2 latency sensitive benchmarks – 16.2% University of Utah

Performance Results (4X – Wires) Wire delay constrained model • Performance improvements are better • Early lookup performs 5% better • Aggressive model performs 28% better University of Utah

Future Work • Heterogeneous network in a CMP environment • Hybrid-network • Employs a combination of point-to-point and bus for L-messages • Effective use of L-wires • Latency/bandwidth trade-off • Use of heterogeneous wires in DNUCA environment • Cache design focusing on power • Pre-fetching (Power optimized wires) • Writeback (Power optimized wires) University of Utah

Conclusion • Traditional design approaches for large caches is sub-optimal • Network parameters play a significant role in the performance of large caches • Modified CACTI model, that includes network overhead performs 16.3% better compared to previous models • Heterogeneous network has potential to further improve the performance • Early lookup – 21.6% • Aggressive lookup – 26.6% University of Utah

The Effect of Interconnect Design on the Performance of Large L2 Caches

The Effect of Interconnect Design on the Performance of Large L2 Caches

Presentation Transcript

The effect on performance

Soft Error Benchmarking of L2 Caches with PARMA

The effect of choice on student motivation and performance.

The Effect of Caffeine on Test Performance

Velocity effect on the Performance of MANEMO

The Effect of Incentive Contracts on Learning and Performance

The Effect of Arousal/Anxiety on Performance

The Effect of Large By on Currents in the Polar Cap

The effect of router buffer size on the TCP performance

EFFECT OF STRENTGH TRAINING ON PERFORMANCE

Interconnect Design Considerations for Large NUCA Caches

The Effect of Mobile IP Handoffs on the Performance of TCP

The effect of teacher tracking tools on student performance

The PERCS High-Performance Interconnect

Effect of functionalization on the performance of MWCNTs with thermosetting polymers

The Effect of…on…..

Effect of MWCNTs on the Performance of Thermosetting Polymers

Effects of Global Interconnect Optimizations on Performance Estimation of Deep Sub-Micron Design

Large Selection of Interior Design on the Web

The effect of unions on company performance

The effect of protons on the performance of swept-charge devices

The Effect of Transmission Design on Force Controlled Manipulator Performance