ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing • Problem: complex processors are ill-suited for commercial applications • Solution: CMP approach • Piranha System: • research prototype developed at COMPAQ • Exploits CMP, intergrades 8 simple Alpha processor cores with 2-level cache hierarchy on a single chip • Piranha unique design choices: • shared second level cache with no inclusion • Highly optimized cache coherence protocol • Novel I/O architecture

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing • Different behavior of commercial workloads relative to technical workloads • large memory stalls • Data-dependent nature of the computation & lack of ILP • No use of high-performance FP and multimedia functionality • Techniques: • SMT • CMT • Goal of Piranha: build a system to achieve superior performance on commercial workloads

Piranha: Architecture Overview

Alpha CPU Core and L1 Caches • CPU • Single issue, in order design capable of executing Alpha ISA • 500 MHZ pipelined datapath • Performance enhancing features: branch target buffer, pre-compute logic for branch predictions, fully by-passed datapath • L1 caches • 64KB two-way set associative blocking caches • 2-bit state field per cache line-> 4 stages in a typical MESI protocol • I and D-cache are kept coherent by hardware

Intra-Chip Switch • Uses bidirectional, push-only interface • Initiator sources data, if destination ready then ICS schedules the data transfer. A grant is issued to the initiator to commense data transfer. The destination receives a request signal: ID of initiator and type of transfer • Each port to ICS consists of 2 independent datapaths • Implemented by a set of 8 internal datapaths • Supports 2 logical lanes (low and high priority)

Second - Level Cache • 1MB unified I/O cache, physically partitioned into 8 banks • Each bank 8-way set associative and uses round robin replacement policy • L2 controllers responsible for intra-chip coherence and cooperate with engines to to enforce inter-chip coherence • Non-inclusive on-chip cache: Keep a duplicate copy of the L1 tags and state at the L2 controllers • L1 misses that also miss in L2 are filled directly from memory. L2 behaves as victim cache • The duplicate L1 state is extended to include “ownership” • Intra-chip coherence protocol • L2 controllers are responsible • Similarities to a full map centralized directory based protocol

Piranha architecture Memory controller • Does not have direct access to ICS, is controlled by and routed through L2 controller • Two parts: 1) RAC and 2) Memory Controller Engine Protocol Engines • Home Engine: responsible for exporting memory whose home is at the local node • Remote Engine: imports memory whose hoem is remote

Inter-node Coherence Protocol • Invalidation-based directory protocol • Support for 4 request types: read, read-exclusive, exclusive, exclusive-without-data • Support features: clean-exclusive optimization, reply forwarding from remote owner, eager exclusive replies • Unique property: avoids NAK • Unique techniques: 1) Network uses “hot potato” routing, 2)Buffer space is shared among all lanes 3)Cruise-missile-invalidates (CMI)

System Interconnect • OO Output queue: accepts packets via the packet switch from the protocol engines or from the system controller • Router: transmits and receives packets to and from other nodes • IQ Input queue: receives packets that are addressed to the local node and forwards them to the target module via the packet switch

Reliability Features RAS features: redundancy on all memory components, CRC protection on most datapaths, redundant datapaths, protocol error recovery, error logging, hot-swappable links and in-band system reconfiguration support

Evaluation • Workloads • DSS workload with TCP-D benchmark • OLTP workload with TCP-B benchmark • Simulation environment • Use of SinOS Alpha environment:simulates hardware components of Alpha based multiprocessors • Simulated architectures

Performance Evaluation of Piranha

Conclusions • Use of CMP in future multiprocessor designs • Piranha: from evaluation: outperforms other designs

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών

Presentation Transcript