Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin

Technology Trends Xeon Nehalem-EX Core i7 Pentium D Pentium 4 Transistor count 486 Pentium 386 286 8086 4004 Year of introduction

Technology Applications

Networks-on-Chip (NOCs) • The backbone of highly integrated chips • Transport of memory, operand, and control traffic • Structured, packet-based, multi-hop networks • Increasing importance with greater levels of integration • Major impact on chip performance, energy, and area • TRIPS: 28% performance losson SPEC 2K in NOC • Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10

On-chip vs Off-chip Interconnects • Topology • Routing • Flow control • Pins • Bandwidth • Power • Area

Future NOC Requirements • 100’s to 1000’s of network clients • Cores, caches, accelerators, I/O ports, … • Efficient topologies • High performance, small footprint • Intelligent routing • Performance through better load balance • Light-weight flow control • High performance, low buffer requirements • Service Guarantees • cloud computing, real-time apps demand QOS support under submission HPCA ‘09 HPCA ‘08 under submission MICRO ‘09

Outline • Introduction • Service Guarantees in Networks-on-Chip • Motivation • Desiderata, prior work • Preemptive Virtual Clock • Evaluation highlights • Efficient Topologies for On-chip Interconnects • Kilo-NOC: A Network for 1000+ Nodes • Summary and Future Work

Why On-chip Quality-of-Service? • Shared on-chip resources • Memory controllers, accelerators, network-on-chip • … require QOS support • fairness, service differentiation, performanceisolation • End-point QOS solutions are insufficient • Data has to traverse the on-chip network • Need QOS support at the interconnect level Hard guarantees in NOCs

NOC QOS Desiderata • Fairness • Isolation of flows • Bandwidth efficiency • Low overhead: • delay • area • energy

Conventional QOS Disciplines • Fixed schedule • Pros: algorithmic and implementation simplicity • Cons: inefficient BW utilization; per-flow queuing • Example: Round Robin • Rate-based • Pros: fine-grained scheduling; BW efficient • Cons: complex scheduling; per-flow queuing • Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89] • Frame-based • Pros: good throughput at modest complexity • Cons: throughput-complexity trade-off; per-flow queuing • Example: Rotating Combined Queuing (RCQ) [ISCA ’96] • Per-flow queuing • Area overhead • Energy overhead • Delay overhead • Scheduling complexity

Preemptive Virtual Clock (PVC) [HPCA ‘09] • Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs. • Full QOS support • Fairness, prioritization, performance isolation • Modest area and energy overhead • Minimal buffering in routers & source nodes • High Performance • Low latency, good BW efficiency

PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect Flow X

PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect • Framing: PVC’s solution to history effect • Frame rollover clears all BW counters • Fixed frame duration

PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect Frame roller - BW counters reset - Priorities reset Flow X

PVC: Freedom from Priority Inversion • PVC: simple routers w/o per-flow buffering and no BW reservation • Problem: high priority packets may be blocked by lower priority packets (priority inversion) x

PVC: Freedom from Priority Inversion • PVC: simple routers w/o per-flow buffering and no BW reservation • Problem: high priority packets may be blocked by lower priority packets (priority inversion) • Solution: preemption of lower priority packets `

PVC: Preemption Recovery • Retransmission of dropped packets • Buffer outstanding packets at the source node • ACK/NACK protocol via a dedicated network • All packets acknowledged • Narrow, low-complexity network • Lower overhead than timeout-based recovery • 64 node network: 30-flit backup buffer per node suffices

PVC: Preemption Throttling • Relaxed definition of priority inversion • Reduces preemption frequency • Small fairness penalty • Per-flow bandwidth reservation • Flits within the reserved quota are non-preemptible • Reserved quota is a function of rate and frame size • Coarsened priority classes • Mask out lower-order bits of each flow’s BW counter • Induces coarser priority classes • Enables a fairness/throughput trade-off

PVC: Guarantees • Minimum Bandwidth • Based on reserved quota • Fairness • Subject to BW counter resolution • Worst-case Latency • Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1

Performance Isolation

Performance Isolation • Baseline NOC • No QOS support • Globally Synchronized Frames (GSF) • J. Lee, et al. ISCA 2008 • Frame-based scheme adapted for on-chip implementation • Source nodes enforce bandwidth quotas via self-throttling • Multiple frames in-flight for performance • Network prioritizes packets based on frame number • Preemptive Virtual Clock (PVC) • Highest fairness setting (unmasked bandwidth counters)

Performance Isolation

PVC Summary • Full QOS support • Fairness & service differentiation • Strong performance isolation • High performance • Inelaborate routers  low latency • Good bandwidth efficiency • Modest area and energy overhead • 3.4 KB of storage per node (1.8x no-QOS router) • 12-20% extra energy per packet

PVC Summary • Full QOS support • Fairness & service differentiation • Strong performance isolation • High performance • Inelaborate routers  low latency • Good bandwidth efficiency • Modest area and energy overhead • 3.4 KB of storage per node (1.8x no-QOS router) • 12-20% extra energy per packet Will it scale to 1000 nodes?

Outline • Introduction • Service Guarantees in Networks-on-Chip • Efficient Topologies for On-chip Interconnects • Mesh-based networks • Toward low-diameter topologies • Multidrop Express Channels • Kilo-NOC: A Network for 1000+ Nodes • Summary and Future Work

NOC Topologies • Topology is the principal determinant of network performance, cost, and energy efficiency • Topology desiderata • Rich connectivity  reduces router traversals • High bandwidth  reduces latency and contention • Low router complexity  reduces area and delay • On-chip constraints • 2D substrates limit implementable topologies • Logic area/energy constrains use of wire resources • Power constrains restrict routing choices

2-D Mesh • Pros • Low design & layout complexity • Simple, fast routers

2-D Mesh • Pros • Low design & layout complexity • Simple, fast routers • Cons • Large diameter • Energy & latency impact

Concentrated Mesh(Balfour & Dally, ICS ‘06) • Pros • Multiple terminals at each node • Fast nearest-neighbor communication via the crossbar • Hop count reduction proportional to concentration degree • Cons • Benefits limited by crossbar complexity

Flattened Butterfly (Kim et al., Micro ‘07) • Objectives: • Improve connectivity • Exploit the wire budget

Flattened Butterfly Point-to-point links Nodes fully connected in each dimension

Flattened Butterfly • Pros • Excellent connectivity • Low diameter: 2 hops • Cons • High channel count: k2/2 per row/column • Low channel utilization • Control complexity

Multidrop Express Channels (MECS) [Grot et al., Micro ‘09] • Objectives: • Connectivity • More scalable channel count • Better channel utilization

Multidrop Express Channels (MECS) • Point-to-multipoint channels • Single source • Multiple destinations • Drop points: • Propagate further -OR- • Exit into a router

Multidrop Express Channels (MECS)

Multidrop Express Channels (MECS) • Pros • One-to-many topology • Low diameter: 2 hops • k channels row/column • I/O asymmetry • Cons • I/O asymmetry • Control complexity

MECS Summary • MECS: a novel one-to-many topology • Excellent connectivity • Effective wire utilization • Good fit for planar substrates • Results summary • MECS: lowest latency, high energy efficiency • Mesh-based topologies: best throughput • Flattened butterfly: smallest router area

Outline • Introduction • Service Guarantees in Networks-on-Chip • Efficient Topologies for On-chip Interconnects • Kilo-NOC: A Networks for 1000+ Nodes • Requirements and obstacles • Topology-centric Kilo-NOC architecture • Evaluation highlights • Summary and Future Work

Scaling to a kilo-node NOC • Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees • MECS scalability obstacles • Buffer requirements: more ports, deeper buffers  area, energy, latency overheads • PVC scalability obstacles • Flow state, other storage area, energy overheads • Preemption overheads energy, latency overheads • Prioritization and arbitration  latency overheads

Scaling to a kilo-node NOC • Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees • MECS scalability obstacles • Buffer requirements: more ports, deeper buffers  area, energy, latency overheads • PVC scalability obstacles • Flow state, other storage area, energy overheads • Preemption overheads energy, latency overheads • Prioritization and arbitration  latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads

NOC QOS: Conventional Approach Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)

NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing

NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing Network-wide guarantees without network-wide QOS support

Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom QOS-free

Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees

Presentation Transcript

Artificial Neural Network in Matlab

The Web Architecture and Components which enable Internet and Web Functionality

Basic WEB Architecture

ChIP-seq

Architectural Design

Large IP Network Architecture Design SYSC 4700 Winter 2014

On-Chip Interconnect Trend and Design Optimization

Packet Scheduling/Arbitration in Virtual Output Queues and Others

High Performance Processor Architecture

Chapter 4 Network Layer

Indoor Wireless LAN (Technology)

NextGen Network Enabled Weather

Budget-Based QoS Management Architecture

Scalable Web Architectures

The Network Layer

The Network Layer

Network Architecture

Chapter 4 Network Layer

Contents