430 likes | 627 Views
Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc.edu, kaxiras@ee.upatras.gr }. VLIW Αρχές. ILP (Instruction-Level Parallelism) Superscalar, OoO: hardware finds it VLIW: let the Software, COMPILER, find it! No need for DYNAMIC EXECUTION Register renaming out Reservation Stations out
E N D
Αρχιτεκτονικές VLIWΣτέφανος Καξίρας{ kaxiras@cs.wisc.edu, kaxiras@ee.upatras.gr }
VLIW Αρχές • ILP (Instruction-Level Parallelism) • Superscalar, OoO: hardware finds it • VLIW: let the Software, COMPILER, find it! • No need for DYNAMIC EXECUTION • Register renaming out • Reservation Stations out • Reorder Buffer out • Out-of-order issue out
VLIW execution semantics • UAL: Unit-assumed Latencies • All latencies eq. • New instr. issues after previous completes • Always finds results ready • NUAL: Non-Uniform Assumed Latencies • Latencies of operations non-unit • New instr. issues immediately, but ops may still be in progress • Instructions must be scheduled when their results are ready (no interlocks)!
VLIW execution semantics • NUAL: Non-Uniform Assumed Latencies • Two models: • Equals (EQ) Model: Each operation takes exactly its specified latency. Register values don’t change until operation completes. Example: TI C6x • Less-Than-or-Equals (LEQ): Operations may take up to their specified latency
VLIW execution semantics • Equals (EQ) Model • Reduces register pressure because source operands stay around longer. • Can’t reduce operation latencies and maintain source code compatibility. • Less-Than-or-Equals (LEQ): • Destination register contents become unreliable immediately • Can reduce operation latencies and maintain source code compatibility
Προβλήματα VLIW • Compiler δεμένος με implementation • Scheduler must know operation latencies • Cannot run binaries in another implementation • Dynamically scheduled VLIW • Αποσύνδεση operation latencies από τον compiler
Dynamically Scheduled VLIW • Compatibility problem: compiler must know latencies • Schedule with assumed latencies • Delay buffer inserted between FUs and register file, holds register updates and presents to the code the “assumed” latencies not the real latencies (similar to LEQ) • Scoreboard dynamically schedules VLIW instructions according to dependencies • VERY SIMILAR to OoO but simpler
Role of COMPILER in VLIW • Find parallelism -- schedule independent instructions • Find independent operations to create VLIW • Many available registers to reduce false data dependencies • INCREASE ILP (create parallelism) • Loop unrolling • Software Pipelining • Trace scheduling • Predication
Loop Unrolling • Basic Idea: Unroll loops to get loop with fewer but longer iterations • Pros: • Creates parallelism -- instructions from different original iterations can be issued in parallel • Latency Tolerance -- can issue instructions from one iteration while waiting for instructions from another to complete • Reduces overhead -- fewer iterations means fewer compares and branches
Loop Unrolling • Cons: • Register pressure -- combining multiple iterations means more • live values, potential for register overflow. • REQUIRES MANY ARCHITECTURAL REGISTERS • INTEL’s EPIC (ITANIUM) Arch has 128 registers!!!
Software pipelining • Idea: Transform loop which performs one iteration at a time into loop which performs pipelined steps of different iterations. • Scheduling: Increase time between dependent instructions • Combines well with loop unrolling
Software Pipelining • Modulo Scheduling
Comparison to Superscalar • Loop Unrolling + Software pipelining = Register Renaming + Multiple branch prediction (loop branch) + Dynamic Scheduling
COMPILER: Reduce CONTROL dependencies • 1 in 5 instructions is a branch • 5-op VLIW ? Each VLI contains a branch! • Unacceptable ... • INCREASE STRAIGHT LINE CODE • code without branches • 2 Techniques in addition to loop unrolling: • TRACE SCHEDULING • PREDICATION
TRACE SCHEDULING • Parallelism across IF branches vs. LOOP branches • Compiler Support - Two steps: • Trace Selection • Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code • Trace Compaction • Squeeze trace into few VLIW instructions • Need bookkeeping code in case prediction is wrong
Trace Scheduling • Similar to branch prediction in SuperScalar OoO • When things go wrong: execute fix-up code (undo wrong path). Compiler inserts all necessary code.
PREDICATION • Avoid branch prediction by turning branches into conditionally executed instructions: • if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction. • Drawbacks to conditional instructions • Complex conditions reduce effectiveness; • Cannot predicate very large blocks
Predication Branch Prediction Predication
Intel/HP EPIC • Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • IA-64: instruction set architecture; EPIC is type • EPIC = 2nd generation VLIW? • Itanium™ is name of first implementation (2001)
Intel EPIC VLIW Instructions • IA-64 instructions are encoded in bundles, which are 128 bits wide. • Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent • Smaller code size than old VLIW, larger than x86/RISC • Groups can be linked to show independence > 3 instr
Intel IA-64 VLIW Instruction groups • Instruction group: a sequence of consecutive instructions with no register data dependences • All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependencies through memory were preserved • An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups
Itanium (or Itanic as in Titanic) • Highly parallel and deeply pipelined hardware at 800Mhz (2000) • 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process • Hardware checks dependencies (interlocks => binary compatibility over time) • DYNAMICALLY SCHEDULED VLIW • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?
Itanium • IA-64 Registers • The integer registers are configured to help accelerate procedure calls using a register stack • 8 64-bit Branch registers used to hold branch destination addresses for indirect branches • 64 1-bit predication registers
Itanium • Both the integer and floating point registers support register rotation for registers 32-128. • Register rotation is designed to ease the task of allocating of registers in software pipelined loops • When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop • Makes the SW-pipelining usable for loops with smaller numbers of iterations