570 likes | 741 Views
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees. Boris Grot The University of Texas at Austin. Technology Trends. Xeon Nehalem-EX. Core i7. Pentium D. Pentium 4. Transistor count. 486. Pentium. 386. 286. 8086. 4004. Year of introduction.
E N D
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin
Technology Trends Xeon Nehalem-EX Core i7 Pentium D Pentium 4 Transistor count 486 Pentium 386 286 8086 4004 Year of introduction
Networks-on-Chip (NOCs) • The backbone of highly integrated chips • Transport of memory, operand, and control traffic • Structured, packet-based, multi-hop networks • Increasing importance with greater levels of integration • Major impact on chip performance, energy, and area • TRIPS: 28% performance losson SPEC 2K in NOC • Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10
On-chip vs Off-chip Interconnects • Topology • Routing • Flow control • Pins • Bandwidth • Power • Area
Future NOC Requirements • 100’s to 1000’s of network clients • Cores, caches, accelerators, I/O ports, … • Efficient topologies • High performance, small footprint • Intelligent routing • Performance through better load balance • Light-weight flow control • High performance, low buffer requirements • Service Guarantees • cloud computing, real-time apps demand QOS support under submission HPCA ‘09 HPCA ‘08 under submission MICRO ‘09
Outline • Introduction • Service Guarantees in Networks-on-Chip • Motivation • Desiderata, prior work • Preemptive Virtual Clock • Evaluation highlights • Efficient Topologies for On-chip Interconnects • Kilo-NOC: A Network for 1000+ Nodes • Summary and Future Work
Why On-chip Quality-of-Service? • Shared on-chip resources • Memory controllers, accelerators, network-on-chip • … require QOS support • fairness, service differentiation, performanceisolation • End-point QOS solutions are insufficient • Data has to traverse the on-chip network • Need QOS support at the interconnect level Hard guarantees in NOCs
NOC QOS Desiderata • Fairness • Isolation of flows • Bandwidth efficiency • Low overhead: • delay • area • energy
Conventional QOS Disciplines • Fixed schedule • Pros: algorithmic and implementation simplicity • Cons: inefficient BW utilization; per-flow queuing • Example: Round Robin • Rate-based • Pros: fine-grained scheduling; BW efficient • Cons: complex scheduling; per-flow queuing • Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89] • Frame-based • Pros: good throughput at modest complexity • Cons: throughput-complexity trade-off; per-flow queuing • Example: Rotating Combined Queuing (RCQ) [ISCA ’96] • Per-flow queuing • Area overhead • Energy overhead • Delay overhead • Scheduling complexity
Preemptive Virtual Clock (PVC) [HPCA ‘09] • Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs. • Full QOS support • Fairness, prioritization, performance isolation • Modest area and energy overhead • Minimal buffering in routers & source nodes • High Performance • Low latency, good BW efficiency
PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect Flow X
PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect • Framing: PVC’s solution to history effect • Frame rollover clears all BW counters • Fixed frame duration
PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect Frame roller - BW counters reset - Priorities reset Flow X
PVC: Freedom from Priority Inversion • PVC: simple routers w/o per-flow buffering and no BW reservation • Problem: high priority packets may be blocked by lower priority packets (priority inversion) x
PVC: Freedom from Priority Inversion • PVC: simple routers w/o per-flow buffering and no BW reservation • Problem: high priority packets may be blocked by lower priority packets (priority inversion) • Solution: preemption of lower priority packets `
PVC: Preemption Recovery • Retransmission of dropped packets • Buffer outstanding packets at the source node • ACK/NACK protocol via a dedicated network • All packets acknowledged • Narrow, low-complexity network • Lower overhead than timeout-based recovery • 64 node network: 30-flit backup buffer per node suffices
PVC: Preemption Throttling • Relaxed definition of priority inversion • Reduces preemption frequency • Small fairness penalty • Per-flow bandwidth reservation • Flits within the reserved quota are non-preemptible • Reserved quota is a function of rate and frame size • Coarsened priority classes • Mask out lower-order bits of each flow’s BW counter • Induces coarser priority classes • Enables a fairness/throughput trade-off
PVC: Guarantees • Minimum Bandwidth • Based on reserved quota • Fairness • Subject to BW counter resolution • Worst-case Latency • Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1
Performance Isolation • Baseline NOC • No QOS support • Globally Synchronized Frames (GSF) • J. Lee, et al. ISCA 2008 • Frame-based scheme adapted for on-chip implementation • Source nodes enforce bandwidth quotas via self-throttling • Multiple frames in-flight for performance • Network prioritizes packets based on frame number • Preemptive Virtual Clock (PVC) • Highest fairness setting (unmasked bandwidth counters)
PVC Summary • Full QOS support • Fairness & service differentiation • Strong performance isolation • High performance • Inelaborate routers low latency • Good bandwidth efficiency • Modest area and energy overhead • 3.4 KB of storage per node (1.8x no-QOS router) • 12-20% extra energy per packet
PVC Summary • Full QOS support • Fairness & service differentiation • Strong performance isolation • High performance • Inelaborate routers low latency • Good bandwidth efficiency • Modest area and energy overhead • 3.4 KB of storage per node (1.8x no-QOS router) • 12-20% extra energy per packet Will it scale to 1000 nodes?
Outline • Introduction • Service Guarantees in Networks-on-Chip • Efficient Topologies for On-chip Interconnects • Mesh-based networks • Toward low-diameter topologies • Multidrop Express Channels • Kilo-NOC: A Network for 1000+ Nodes • Summary and Future Work
NOC Topologies • Topology is the principal determinant of network performance, cost, and energy efficiency • Topology desiderata • Rich connectivity reduces router traversals • High bandwidth reduces latency and contention • Low router complexity reduces area and delay • On-chip constraints • 2D substrates limit implementable topologies • Logic area/energy constrains use of wire resources • Power constrains restrict routing choices
2-D Mesh • Pros • Low design & layout complexity • Simple, fast routers
2-D Mesh • Pros • Low design & layout complexity • Simple, fast routers • Cons • Large diameter • Energy & latency impact
Concentrated Mesh(Balfour & Dally, ICS ‘06) • Pros • Multiple terminals at each node • Fast nearest-neighbor communication via the crossbar • Hop count reduction proportional to concentration degree • Cons • Benefits limited by crossbar complexity
Flattened Butterfly (Kim et al., Micro ‘07) • Objectives: • Improve connectivity • Exploit the wire budget
Flattened Butterfly Point-to-point links Nodes fully connected in each dimension
Flattened Butterfly • Pros • Excellent connectivity • Low diameter: 2 hops • Cons • High channel count: k2/2 per row/column • Low channel utilization • Control complexity
Multidrop Express Channels (MECS) [Grot et al., Micro ‘09] • Objectives: • Connectivity • More scalable channel count • Better channel utilization
Multidrop Express Channels (MECS) • Point-to-multipoint channels • Single source • Multiple destinations • Drop points: • Propagate further -OR- • Exit into a router
Multidrop Express Channels (MECS) • Pros • One-to-many topology • Low diameter: 2 hops • k channels row/column • I/O asymmetry • Cons • I/O asymmetry • Control complexity
MECS Summary • MECS: a novel one-to-many topology • Excellent connectivity • Effective wire utilization • Good fit for planar substrates • Results summary • MECS: lowest latency, high energy efficiency • Mesh-based topologies: best throughput • Flattened butterfly: smallest router area
Outline • Introduction • Service Guarantees in Networks-on-Chip • Efficient Topologies for On-chip Interconnects • Kilo-NOC: A Networks for 1000+ Nodes • Requirements and obstacles • Topology-centric Kilo-NOC architecture • Evaluation highlights • Summary and Future Work
Scaling to a kilo-node NOC • Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees • MECS scalability obstacles • Buffer requirements: more ports, deeper buffers area, energy, latency overheads • PVC scalability obstacles • Flow state, other storage area, energy overheads • Preemption overheads energy, latency overheads • Prioritization and arbitration latency overheads
Scaling to a kilo-node NOC • Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees • MECS scalability obstacles • Buffer requirements: more ports, deeper buffers area, energy, latency overheads • PVC scalability obstacles • Flow state, other storage area, energy overheads • Preemption overheads energy, latency overheads • Prioritization and arbitration latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads
NOC QOS: Conventional Approach Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)
NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing
NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing
NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing
NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing Network-wide guarantees without network-wide QOS support
Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom QOS-free
Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom
Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom
Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom
Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom