Tarantula: A Vector Extension to the Alpha Architecture

Tarantula: A Vector Extension to the Alpha Architecture Espasa, et al. Compaq-UPC Microprocessor Lab in Spain Alpha Development Group in Massachusetts Presented by Curt Harting

Motivation • In order to build CMPs, multithreaded systems, etc. control logic scales non-linearly • Power and area that is being used without doing any computation • Each instruction only does a limited number of computations • Multi-Billion dollar scientific (parallel) industry • Want to exploit memory/L2 bandwidth as much as possible • On unit, regular non-unit, and irregular strides • Can’t invent a new ISA/Coherence protocol, or spend much time on development

Overview • A vector extension to the Alpha architecture • A vector unit (VBox) on chip with the EV8 core, made up of 16 lanes. • Capable of 32 flops/cycle • 16MB L2 cache to pipe data directly into the VBox

Architectural Changes • 45 New instructions • Predication, not prediction within VBox • 35 Architectural Registers • 32 Vector Registers (64*128bits each), VL, VS, VM • Register Renaming • V31 tied to 0 for easy prefetching (128 lines, or 8kB with 1 insn) • Runs old code • Must recompile to take advantage of the VBox

The VBox • 16 Lanes • Slice of registers & mask – unified register file would be too large • 2 functional units (North and South) • Instruction, LD, and ST queues • TLB - 1 per lane, 32 entries each • 512MB virtual pages • On a miss, either fill one or all • Symmetric • Multithreaded

Communication with EV8 Core • 3 Instruction bus to IQ • 3 9 bit insns ID buses for retirement from VCU • A bus to carry scalars (2x64 bit) • Kill Signal • On exception, only the instruction is given, not the faulting lane

Memory, Addressing • VBox communicates solely with the L2 • L2 has 16 banks that can be accessed in parallel • Normal Strides – Those that aren’t self conflicting or 1 • Built in ROM to generate 8 slices of 16qw (8 cycles) • PUMP operation – Stride of 1 (16/17 cache lines) • 2x the bandwidth (4 cycles) – Routed through a special structure • Makes a difference (sometimes) • Gather/Scatters – Arbitrary Addresses • Greedy algorithm in the CR box • Worst Case: 128 cycles • Self-Conflicting Strides – Stride=y*2^x where y%2=1, x>4 • Treated as a Gather/Scatter • Caveat: Still have to wait the full time regardless of the number of quadwords needed

Memory, Consistency • Problem: VBox writes to the L2, behind the L1’s back! • Every line in the L2 has a presence bit that is set if the EV8 core has touched that line • If a line has its P-bit set, the L2 must essentially issue a GETX to the L1. • Scalar Write, Vector Read: The vector read can’t see store/write buffers (no P-bit set yet) • Programmer/Complier must anticipate this case and add an extra barrier • DrainM forces a purging of the store/write buffers into cache • Also forces the killing and re-fetching of younger instructions • On a cache miss, the entire slice waits until the offending block is replaced and a retry occurs • After a threshold of retries, the cache entries a panic mode

Evaluation • Vectorizable portions of vector benchmarks chosen • Large Vectors chosen, they do better • All but 2 (sixtrack, linpack100) have over 98% vector code • EV8 Code compiled with an EV6 scheduler • Hand compiled and hand tuned for Tarantula • All benchmarks “cache-friendly” or custom tiled (up to 2x speedup) • Many more registers in Tarantula • Large prefetches • A standard mirco-processor being compared to the specialized processor

Low-Level Questions • Power – only 20% more than equivalent EV8 CMP. Is 144W really a power win, even with the increase in performance per watt? • Memory Bandwidth – “One the most expenseive pieces of overall cost” • Why was it assumed to quadruple in four years? • Why does the system not have it as a bottleneck? Amdahl’s Law and Fig. 8 • L2 Sizing/bandwidth seemed critical to the performance, what would happen if the L2 was smaller and/or slower? • Was DrainM the best way of accomplishing its goal?

High-Level Questions • Gather/Scatter support seems like a great idea • How many programs touch random parts of an array in a parallel fashion? • How can you compile pointer walk throughs? • Is there a multi-billion dollar scientific compute industry? • If so, is this processor an answer for it? Only does well for large vectors. • Is this a commodity processor or an expensive system? • The paper implies the goal of making it a commodity plug and play processor • Talks of very large memory bandwidth requirements, power requirements, huge L2 • Is Tarantula one of those ideas that goes from good to bad to good? • Did Tarantula catch on? Just Google it!

Tarantula: A Vector Extension to the Alpha Architecture