1 / 43

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc, kaxiras@ee.upatras.gr }

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc.edu, kaxiras@ee.upatras.gr }. VLIW Αρχές. ILP (Instruction-Level Parallelism) Superscalar, OoO: hardware finds it VLIW: let the Software, COMPILER, find it! No need for DYNAMIC EXECUTION Register renaming out Reservation Stations out

najwa
Download Presentation

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc, kaxiras@ee.upatras.gr }

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Αρχιτεκτονικές VLIWΣτέφανος Καξίρας{ kaxiras@cs.wisc.edu, kaxiras@ee.upatras.gr }

  2. VLIW Αρχές • ILP (Instruction-Level Parallelism) • Superscalar, OoO: hardware finds it • VLIW: let the Software, COMPILER, find it! • No need for DYNAMIC EXECUTION • Register renaming out • Reservation Stations out • Reorder Buffer out • Out-of-order issue out

  3. VLIW Αρχές

  4. VLIW: Very Long Instruction Word

  5. VLIW architetcure

  6. VLIW Execution Semantics

  7. VLIW Execution Semantics

  8. SuperScalar vs. VLIW

  9. VLIW execution semantics • UAL: Unit-assumed Latencies • All latencies eq. • New instr. issues after previous completes • Always finds results ready • NUAL: Non-Uniform Assumed Latencies • Latencies of operations non-unit • New instr. issues immediately, but ops may still be in progress • Instructions must be scheduled when their results are ready (no interlocks)!

  10. VLIW execution semantics • NUAL: Non-Uniform Assumed Latencies • Two models: • Equals (EQ) Model: Each operation takes exactly its specified latency. Register values don’t change until operation completes. Example: TI C6x • Less-Than-or-Equals (LEQ): Operations may take up to their specified latency

  11. VLIW execution semantics • Equals (EQ) Model • Reduces register pressure because source operands stay around longer. • Can’t reduce operation latencies and maintain source code compatibility. • Less-Than-or-Equals (LEQ): • Destination register contents become unreliable immediately • Can reduce operation latencies and maintain source code compatibility

  12. Προβλήματα VLIW • Compiler δεμένος με implementation • Scheduler must know operation latencies • Cannot run binaries in another implementation • Dynamically scheduled VLIW • Αποσύνδεση operation latencies από τον compiler

  13. Dynamically Scheduled VLIW • Compatibility problem: compiler must know latencies • Schedule with assumed latencies • Delay buffer inserted between FUs and register file, holds register updates and presents to the code the “assumed” latencies not the real latencies (similar to LEQ) • Scoreboard dynamically schedules VLIW instructions according to dependencies • VERY SIMILAR to OoO but simpler

  14. Role of COMPILER in VLIW • Find parallelism -- schedule independent instructions • Find independent operations to create VLIW • Many available registers to reduce false data dependencies • INCREASE ILP (create parallelism) • Loop unrolling • Software Pipelining • Trace scheduling • Predication

  15. Loop Unrolling • Basic Idea: Unroll loops to get loop with fewer but longer iterations • Pros: • Creates parallelism -- instructions from different original iterations can be issued in parallel • Latency Tolerance -- can issue instructions from one iteration while waiting for instructions from another to complete • Reduces overhead -- fewer iterations means fewer compares and branches

  16. Loop Unrolling • Cons: • Register pressure -- combining multiple iterations means more • live values, potential for register overflow. • REQUIRES MANY ARCHITECTURAL REGISTERS • INTEL’s EPIC (ITANIUM) Arch has 128 registers!!!

  17. Loop Unrolling Example 1

  18. Loop Unrolling Example 2: no Unroll

  19. Loop Unrolling example 2: No Unroll

  20. Loop Unrolling Example 2

  21. Loop Unrolling Example 2

  22. Software pipelining • Idea: Transform loop which performs one iteration at a time into loop which performs pipelined steps of different iterations. • Scheduling: Increase time between dependent instructions • Combines well with loop unrolling

  23. Software Pipelining • Modulo Scheduling

  24. Software Pipelining: modulo scheduling

  25. Comparison to Superscalar • Loop Unrolling + Software pipelining = Register Renaming + Multiple branch prediction (loop branch) + Dynamic Scheduling

  26. COMPILER: Reduce CONTROL dependencies • 1 in 5 instructions is a branch • 5-op VLIW ? Each VLI contains a branch! • Unacceptable ... • INCREASE STRAIGHT LINE CODE • code without branches • 2 Techniques in addition to loop unrolling: • TRACE SCHEDULING • PREDICATION

  27. TRACE SCHEDULING • Parallelism across IF branches vs. LOOP branches • Compiler Support - Two steps: • Trace Selection • Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code • Trace Compaction • Squeeze trace into few VLIW instructions • Need bookkeeping code in case prediction is wrong

  28. Trace Scheduling • Similar to branch prediction in SuperScalar OoO • When things go wrong: execute fix-up code (undo wrong path). Compiler inserts all necessary code.

  29. PREDICATION • Avoid branch prediction by turning branches into conditionally executed instructions: • if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction. • Drawbacks to conditional instructions • Complex conditions reduce effectiveness; • Cannot predicate very large blocks

  30. Predication Branch Prediction Predication

  31. Intel/HP EPIC • Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • IA-64: instruction set architecture; EPIC is type • EPIC = 2nd generation VLIW? • Itanium™ is name of first implementation (2001)

  32. Intel EPIC VLIW Instructions • IA-64 instructions are encoded in bundles, which are 128 bits wide. • Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent • Smaller code size than old VLIW, larger than x86/RISC • Groups can be linked to show independence > 3 instr

  33. Intel EPIC VLIW Instructions

  34. Itanium

  35. Instruction group/Bundle

  36. Intel IA-64 VLIW Instruction groups • Instruction group: a sequence of consecutive instructions with no register data dependences • All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependencies through memory were preserved • An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups

  37. Intel IA-64 VLIW Instruction groups

  38. Itanium (or Itanic as in Titanic) • Highly parallel and deeply pipelined hardware at 800Mhz (2000) • 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process • Hardware checks dependencies (interlocks => binary compatibility over time) • DYNAMICALLY SCHEDULED VLIW • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

  39. Itanium • IA-64 Registers • The integer registers are configured to help accelerate procedure calls using a register stack • 8 64-bit Branch registers used to hold branch destination addresses for indirect branches • 64 1-bit predication registers

  40. IA-64/Itanium registers

  41. Itanium • Both the integer and floating point registers support register rotation for registers 32-128. • Register rotation is designed to ease the task of allocating of registers in software pipelined loops • When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop • Makes the SW-pipelining usable for loops with smaller numbers of iterations

  42. Itanium

  43. Itanium

More Related