1 / 12

CDA 5155

Superscalar, VLIW, Vector, Decoupled Week 4. CDA 5155. Processors Design Families. Superscalar Not an Architectural Specification! Vector Processors Simplest hardware – great for the right problems Statically Scheduled Multiple Issue Better known as Very Long Instruction Word (VLIW)

peigi
Download Presentation

CDA 5155

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superscalar, VLIW, Vector, Decoupled Week 4 CDA 5155

  2. Processors Design Families • Superscalar • Not an Architectural Specification! • Vector Processors • Simplest hardware – great for the right problems • Statically Scheduled Multiple Issue • Better known as Very Long Instruction Word (VLIW) • Compiler dominated Scheduling • Better known as EPIC (almost VLIW) • Decoupled Architectures • Tightly interconnected Scalar Processors • Relatively unknown area, influencing current designs • (also my dissertation research)

  3. Vector Processors “I’m certainly not inventing vector processors. There are three kinds that I know of existing today. They are represented by the Illiac-IV, the (CDC) Star processor, and the TI(ASC) processor. Those three were all pioneering processors… One of the problems of being a pioneer is you always make mistakes and I never, never want to be a pioneer. It’s always best to come second when you can look at the mistakes the pioneers made” - Seymour Cray (Cray-1 1976)

  4. Vector Processor Design • Early “super computers” • Add Special instructions (addV) that operate on sequences (or vectors) of data • A single instruction defines a long sequence of operations to be performed. • Sequences do not have hazards – no stalling, forwarding, etc. • Eliminates the need for overhead instructions for loop iteration • Very simple pipeline organization • More constrained memory access makes scheduling LV/SV instructions match memory banking designs • This enables very efficient use of memory bus (like caches do to a smaller extent)

  5. Organization of a Vector Machine

  6. Handling Vectors in Memory LV V1  Mem[R1] Loads an entire vector of data starting at location M[R1] • This looks a lot like a cache line fill operation • Can design the number of memory banks to reflect the vector size. • What about non-contiguous accesses? • Column access on a 2D array; elements out of a structure • LV V1  Mem[R1],R2 Loads vector starting at R1, with a stride of R2 bytes • What about more complex accesses? • Indexed (scatter/gather) access • LV V1  Mem[R1], V2 V1[1]  Mem[R1+V2[1]]; V1[2]  Mem[R1+V2[2]]; etc.

  7. Pipelining Vectors

  8. Chaining Vectors Enable forwarding of vectors (DAXPY: Z = aX + Y) LV V1, R1 ; load X LV V2, R2 ; load Y MULSV V3, F0, V1 ; calculate aX ADDV V4, V3, V2 ; calculate (aX) + Y SV V4, R3 ; store at Z How can we overlap instructions?

  9. Other Vector Issues • Compiler analysis to find vectorizable code • Determining vector length • Amdahl’s law • Complexity • Code base • Image Processing, scientific code (genomes?), graphics (MMX)

  10. VLIW Processors • What happens to hardware complexity if we make the microarchitecture (pipeline organization) visible to the programmer/compiler? • Scheduling is a software problem • Hazard detection is a software problem • Memory Scheduling is (mostly) a software problem • Speculation (branch prediction) is (mostly) a software problem • Hardware is simpler! • Compiler/Programmer’s job is much harder

  11. Non-unit latency • No hazard detection • If we write code that reads R3, it means whatever is in R3 at that cycle. • Note: that Superscalar will get the most recent definition (that is what the hazard detector check for) • R1  5 • R1  10 • R2  R1 (5 or 10?) • It depends on the structure of the pipeline (which is known by the software) • Pipeline registers are visible to the compiler (but may not be accessed)

  12. Decoupled Processors Multiple Processors Asynchronous Queues P1: LD X[i] P3 P2: LD Y[i]  P4 P3: Mul a,Mem  P4 P4 Add P3, Mem  Mem

More Related