Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27th Annual ISCA, June 2000 Presented by Garver Moore ECE259 Spring 2006 Professor Daniel Sorin

Motivation • Economic: High demand for OLTP machines • Disconnect between ILP-focus and this demand • OLTP -- High memory latency -- Little ILP (Get, process, store) -- Large TLP • OLTP unserved by aggressive ILP machines • Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines and low development costs and time to market • Amdahl’s Law

The Piranha Processing Node* CPU: Alpha ECE152 work Single in-order 8-stage pipeline 180 nm process (2000) Almost entirely ASIC design 50% clock speed, 200% area versus full-custom methodology Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec. *Directly from Barroso et. al

Communication Assist + Home Engine and Remote Engine support shared memory across multiple nodes + System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc. + OQ, Router, IQ, Switch standard +Total inter-node I/O Bandwidth : 32 GB/sec + Each link and block here corresponds to actual wiring and module. + This allows for rapid parallel development and an semi-custom design methodology + Also facilitates multiple clock domains THERE IS NO INHERENT I/O CAPABILITY.

I/O Organization + Smaller than processing node + Router  2 links, alleviates need for routing table + Memory is globally visible and part of coherency scheme + CPU  optimized placement for drivers, translations etc. with low-latency access needs to I/O. + Re-used dL1 design provides interface to PCI/X interface + Supports arbitrary I/O:P ratio, network topology + Glueless scaling up to 1024 nodes of any type supports application specific customization

Coherence: Local + L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory + Chip ICS responsible for all on-chip communication + L2 is “non-inclusive”. + “Large victim buffer” for L1s. Keeps tags and state copies of L1 data + The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist. + L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory. +L2 on forwards blocks conflicting requests

Coherence: Global • Trades ECC granularity for “free” directory data storage (4x granularity  leaves 44 bits per 64 bit line) • Invalidation-based distributed directory protocol • Some optimizations • No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies • Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges. • Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry-counts (Origin), “No, seriously” (Token). • Routing toward empty buffers for old messages  linear buffer dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.

Evaluation Methodology • Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications) • Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware • “Fudged” for full-custom effect • Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)

Results

Questions/Discussion • Deadlock avoidance w/o NACK • CMP vs SMP • “Fishy” evaluation methodology? • Specialized computing • Buildability?

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Presentation Transcript

Alpha 21364: A Scalable Single-chip SMP

Single-Chip Multiprocessors: the Rebirth of Parallel Architecture

Attitude on a Chip Single Antenna GPS Attitude SAGA

On-Chip Communication (Architecture and Design)

DSP Chip Architecture

Piranha

System Architecture for On-Chip Networks

A Scalable Internet Architecture

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Piranha

Piranha

Single-Chip Multiprocessor

“Location Based On-Chip Variation”

Adaptive Single-Chip Multiprocessing

PAMA Power Aware Multiprocessing Architecture

Architecture Design of a Scalable Single-Chip Multi-Processor

* Baseline model for a single plant, Major program SCALABLE based on program complexity/cadence

On the Single-Chip Implementation of IEEE802.11a

Boosting XML filtering through a scalable FPGA-based architecture

System-On-a-Chip: A case study based on the ELIET Chip

Power Management for Chip-level Multiprocessing Processors

NOCARC Network on Chip Architecture