1 / 11

Tarantula: A Vector Extension to the Alpha Architecture

Tarantula: A Vector Extension to the Alpha Architecture. Espasa, et al. Compaq-UPC Microprocessor Lab in Spain Alpha Development Group in Massachusetts Presented by Curt Harting. Motivation. In order to build CMPs, multithreaded systems, etc. control logic scales non-linearly

paulareid
Download Presentation

Tarantula: A Vector Extension to the Alpha Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tarantula: A Vector Extension to the Alpha Architecture Espasa, et al. Compaq-UPC Microprocessor Lab in Spain Alpha Development Group in Massachusetts Presented by Curt Harting

  2. Motivation • In order to build CMPs, multithreaded systems, etc. control logic scales non-linearly • Power and area that is being used without doing any computation • Each instruction only does a limited number of computations • Multi-Billion dollar scientific (parallel) industry • Want to exploit memory/L2 bandwidth as much as possible • On unit, regular non-unit, and irregular strides • Can’t invent a new ISA/Coherence protocol, or spend much time on development

  3. Overview • A vector extension to the Alpha architecture • A vector unit (VBox) on chip with the EV8 core, made up of 16 lanes. • Capable of 32 flops/cycle • 16MB L2 cache to pipe data directly into the VBox

  4. Architectural Changes • 45 New instructions • Predication, not prediction within VBox • 35 Architectural Registers • 32 Vector Registers (64*128bits each), VL, VS, VM • Register Renaming • V31 tied to 0 for easy prefetching (128 lines, or 8kB with 1 insn) • Runs old code • Must recompile to take advantage of the VBox

  5. The VBox • 16 Lanes • Slice of registers & mask – unified register file would be too large • 2 functional units (North and South) • Instruction, LD, and ST queues • TLB - 1 per lane, 32 entries each • 512MB virtual pages • On a miss, either fill one or all • Symmetric • Multithreaded

  6. Communication with EV8 Core • 3 Instruction bus to IQ • 3 9 bit insns ID buses for retirement from VCU • A bus to carry scalars (2x64 bit) • Kill Signal • On exception, only the instruction is given, not the faulting lane

  7. Memory, Addressing • VBox communicates solely with the L2 • L2 has 16 banks that can be accessed in parallel • Normal Strides – Those that aren’t self conflicting or 1 • Built in ROM to generate 8 slices of 16qw (8 cycles) • PUMP operation – Stride of 1 (16/17 cache lines) • 2x the bandwidth (4 cycles) – Routed through a special structure • Makes a difference (sometimes) • Gather/Scatters – Arbitrary Addresses • Greedy algorithm in the CR box • Worst Case: 128 cycles • Self-Conflicting Strides – Stride=y*2^x where y%2=1, x>4 • Treated as a Gather/Scatter • Caveat: Still have to wait the full time regardless of the number of quadwords needed

  8. Memory, Consistency • Problem: VBox writes to the L2, behind the L1’s back! • Every line in the L2 has a presence bit that is set if the EV8 core has touched that line • If a line has its P-bit set, the L2 must essentially issue a GETX to the L1. • Scalar Write, Vector Read: The vector read can’t see store/write buffers (no P-bit set yet) • Programmer/Complier must anticipate this case and add an extra barrier • DrainM forces a purging of the store/write buffers into cache • Also forces the killing and re-fetching of younger instructions • On a cache miss, the entire slice waits until the offending block is replaced and a retry occurs • After a threshold of retries, the cache entries a panic mode

  9. Evaluation • Vectorizable portions of vector benchmarks chosen • Large Vectors chosen, they do better • All but 2 (sixtrack, linpack100) have over 98% vector code • EV8 Code compiled with an EV6 scheduler • Hand compiled and hand tuned for Tarantula • All benchmarks “cache-friendly” or custom tiled (up to 2x speedup) • Many more registers in Tarantula • Large prefetches • A standard mirco-processor being compared to the specialized processor

  10. Low-Level Questions • Power – only 20% more than equivalent EV8 CMP. Is 144W really a power win, even with the increase in performance per watt? • Memory Bandwidth – “One the most expenseive pieces of overall cost” • Why was it assumed to quadruple in four years? • Why does the system not have it as a bottleneck? Amdahl’s Law and Fig. 8 • L2 Sizing/bandwidth seemed critical to the performance, what would happen if the L2 was smaller and/or slower? • Was DrainM the best way of accomplishing its goal?

  11. High-Level Questions • Gather/Scatter support seems like a great idea • How many programs touch random parts of an array in a parallel fashion? • How can you compile pointer walk throughs? • Is there a multi-billion dollar scientific compute industry? • If so, is this processor an answer for it? Only does well for large vectors. • Is this a commodity processor or an expensive system? • The paper implies the goal of making it a commodity plug and play processor • Talks of very large memory bandwidth requirements, power requirements, huge L2 • Is Tarantula one of those ideas that goes from good to bad to good? • Did Tarantula catch on? Just Google it!

More Related