290 likes | 379 Views
Self-Tuned Congestion Control for Multiprocessor Networks. Mithuna Thottethodi Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu Department of Computer Sciences Duke University, Durham, North Carolina. Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group
E N D
Self-Tuned Congestion Control for Multiprocessor Networks Mithuna Thottethodi Alvin R. Lebeck {mithuna,alvy}@cs.duke.edu Department of Computer Sciences Duke University, Durham, North Carolina Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Appeared in the 7th International Symposium on High-Performance Computer Architecture (HPCA), Monterrey, Mexico, January, 2001
router Why Network Saturation? • Tree saturation • Deadlock cycles • New packets block older packets • Backpressure take 1000s of cycles to propagate back
CPUs Router Why Do We Care? • Computation power per router is increasing • More aggressive speculation • Simultaneous Multithreading • Chip Multiprocessors • “Unstable” behavior makes designers very nervous
So, what’s the solution? • Throttle • stop injecting packets when you hit a “threshold” • “threshold” = % full network buffers • But • Local estimate of threshold insufficient • Saturation point differs for communication patterns • Questions • How do we collect global estimate of % full network buffers? • How do we “tune” the threshold to different patterns?
Outline • Overview • Multiprocessor Network Basics • Deadlocks & virtual channels • Adaptive routing & Duato’s theory • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects
router A Multiprocessor Network
1 3 1 2 1 3 2 4 4 2 1 2 4 3 2 4 4 2 3 1 4 3 Virtual Channels (red & yellow) Deadlocked 3 1 Deadlock Avoidance
Virtual Channels (VC) 1 3 1 One Buffer Per VC 4 2 2 4 4 3 3 1 Logically, red and yellow networks (deadlock-free)
Duato’s Theory • Adaptive network for high performance • deadlock-prone • Deadlock-free network when adaptive network deadlocks • drop down to deadlock-free when router is congested • Implemented with different virtual channels • adaptive virtual channels • deadlock-free virtual channels (escape channels)
Outline • Overview • Multiprocessor Network Basics • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects
Global Estimate of Congestion • % of full buffers in entire network • more & more buffers occupied when network saturates • throttle network when % full buffers cross threshold • Advantages • simple aggregation • empirical observation: works well • Disadvantages • doesn’t detect localized congestion • threshold differs for communication patterns (we solve this)
Gather Global Information • Global Information • % full network buffers in an “interval” • % packets or flits delivered during an “interval” • Constraint • gather time << backpressure buildup time (1000s of cycles) • Mechanisms • piggybacking • meta-packets • side-band signal
Sideband: Dimension-wise Aggregation Each hop takes h cycles on the sideband After 2 hops, aggregation in one dimenstion done 2 such phases Total gather time = 2 * 2 * h = 4h cycles For k-ary, n-cubes, gather-time (g) = n * k * h / 2 For a 16x16 network, g = 2 * 16 * 2 / 2 = 32 cycles Phase I Phase 2
Outline • Overview • Multiprocessor Network Basics • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects
Currently throttling? Drop in Bandwidth > 25% Yes No No No Change Increment Yes Decrement Decrement Dynamic Detection of Threshold(Hill Climbing) B Throughput A C 0 % full buffers (%) Threshold … we may still creep into saturation (later)
Summary of Approach • Global Knowledge of a Network • Collect % full network buffers and overall throughput • Dimension-wise aggregation, g-cycle snapshots • Aggregation via sideband signals • Dynamically detect throttling threshold • Threshold = % of full network buffers • Self-tuned using hill climbing • Reset if hill climbing fails
Outline • Overview • Multiprocessor Network Basics • How to collect global estimate of congestion? • How to “tune” the throttle threshold? • Methodology & Results • Summary, Future Work, & Other Projects
Methodology • Flitsim 2.0 Simulator (Pinkston’s group at USC) • warmup for 10k cycles, simulate for 50k cycles • Network architecture • 16x16 two-dimensional torus (16-ary, 2-cube) • Full-duplex links • Packet size = 16 flits • Wormhole routing • Deadlock avoidance (paper has deadlock recovery results) • Router architecture • 3 virtual channels per physical channel • Each virtual channel buffer holds 8 flits • 1 cycle central arbitration, 1 cycle switching
Input Traffic • Packet Generation Frequency • “attempt” to send one packet per packet regeneration interval • Traffic Patterns • Random destination • Perfect Shuffle: an-1an-2... a1a0 an-2an-3 ... a0an-1 • Butterfly: an-1an-2... a1a0 a0an-2 … a1an-1 • Bit Reversal: an-1an-2... a1a0 a0a1... an-2an-1
Throttling Algorithms • Base • no throttling • ALO (At Least One) • Lopez, Martinez, and Duato, ICPP, August, 1998 • Throttling based on local estimation of congestion • Inject new packet only if • “useful” physical channel has all virtual channels free, or • at least one virtual channel on every “useful” channel is free • Tune (this work)
Tuning Parameters • Total number of network buffers = 256 * 3 * 4 = 3072 • Gather time (g) = n * k * h / 2 = 32 cycles • Sideband communication latency (h) = 2 cycles • Sideband communication bandwidth = 25 bits (!) • # network buffers = 3072 = 12 bits • max throughput = g * 256 * 1 = 8192 = 13 bits • Tuning frequency = once every 96 cycles • Initial threshold value = 1% ~= 30 buffers • Threshold increment = 1%, decrement = 4%
Random Pattern Beyond saturation point, Tune outperforms ALO and Base
Delayed Collection of Global Knowledge (h = 2, 3, 6 cycles) Tune fairly insensitive to delayed collection of information
Static Threshold Choice Optimal Thesholds different for random and butterfly Tune performs close to the best static threshold
With Bursty Load Tune outperforms ALO random bit reversal shuffle butterfly
Avoiding Local Maxima • What if steady decrease in bandwidth < 25%? • potential to “creep” into saturation • Solution: remember global maxima • max = maximum throughput seen in any tuning period • Nmax = number of full buffers at max • Tmax = threshold at max • Reset threshold min(Tmax, Nmax) if throughput < 50% max • If “r” consecutive resets don’t fix the problem, then restart • hypothesis: communication pattern has changed
Threshold Reset Necessary Hill Climbing + Local Maxima Hill Climbing Hill Climbing + Local Maxima Hill Climbing Packet Rengeration Interval = 10 cycles
Summary • Network Saturation is a severe problem • advent of powerful processors, SMT, and CMPs • “unstable” behavior makes designers nervous • We propose throttling based on global knowledge • aggregate global knowledge (% full buffers,throughput) • throttle when % full buffers exceed threshold • tune threshold for communication patters & offered load