Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc, kaxiras@ee.upatras.gr }

Αρχιτεκτονικές VLIWΣτέφανος Καξίρας{ kaxiras@cs.wisc.edu, kaxiras@ee.upatras.gr }

VLIW Αρχές • ILP (Instruction-Level Parallelism) • Superscalar, OoO: hardware finds it • VLIW: let the Software, COMPILER, find it! • No need for DYNAMIC EXECUTION • Register renaming out • Reservation Stations out • Reorder Buffer out • Out-of-order issue out

VLIW Αρχές

VLIW: Very Long Instruction Word

VLIW architetcure

VLIW Execution Semantics

SuperScalar vs. VLIW

VLIW execution semantics • UAL: Unit-assumed Latencies • All latencies eq. • New instr. issues after previous completes • Always finds results ready • NUAL: Non-Uniform Assumed Latencies • Latencies of operations non-unit • New instr. issues immediately, but ops may still be in progress • Instructions must be scheduled when their results are ready (no interlocks)!

VLIW execution semantics • NUAL: Non-Uniform Assumed Latencies • Two models: • Equals (EQ) Model: Each operation takes exactly its specified latency. Register values don’t change until operation completes. Example: TI C6x • Less-Than-or-Equals (LEQ): Operations may take up to their specified latency

VLIW execution semantics • Equals (EQ) Model • Reduces register pressure because source operands stay around longer. • Can’t reduce operation latencies and maintain source code compatibility. • Less-Than-or-Equals (LEQ): • Destination register contents become unreliable immediately • Can reduce operation latencies and maintain source code compatibility

Προβλήματα VLIW • Compiler δεμένος με implementation • Scheduler must know operation latencies • Cannot run binaries in another implementation • Dynamically scheduled VLIW • Αποσύνδεση operation latencies από τον compiler

Dynamically Scheduled VLIW • Compatibility problem: compiler must know latencies • Schedule with assumed latencies • Delay buffer inserted between FUs and register file, holds register updates and presents to the code the “assumed” latencies not the real latencies (similar to LEQ) • Scoreboard dynamically schedules VLIW instructions according to dependencies • VERY SIMILAR to OoO but simpler

Role of COMPILER in VLIW • Find parallelism -- schedule independent instructions • Find independent operations to create VLIW • Many available registers to reduce false data dependencies • INCREASE ILP (create parallelism) • Loop unrolling • Software Pipelining • Trace scheduling • Predication

Loop Unrolling • Basic Idea: Unroll loops to get loop with fewer but longer iterations • Pros: • Creates parallelism -- instructions from different original iterations can be issued in parallel • Latency Tolerance -- can issue instructions from one iteration while waiting for instructions from another to complete • Reduces overhead -- fewer iterations means fewer compares and branches

Loop Unrolling • Cons: • Register pressure -- combining multiple iterations means more • live values, potential for register overflow. • REQUIRES MANY ARCHITECTURAL REGISTERS • INTEL’s EPIC (ITANIUM) Arch has 128 registers!!!

Loop Unrolling Example 1

Loop Unrolling Example 2: no Unroll

Loop Unrolling example 2: No Unroll

Loop Unrolling Example 2

Software pipelining • Idea: Transform loop which performs one iteration at a time into loop which performs pipelined steps of different iterations. • Scheduling: Increase time between dependent instructions • Combines well with loop unrolling

Software Pipelining • Modulo Scheduling

Software Pipelining: modulo scheduling

Comparison to Superscalar • Loop Unrolling + Software pipelining = Register Renaming + Multiple branch prediction (loop branch) + Dynamic Scheduling

COMPILER: Reduce CONTROL dependencies • 1 in 5 instructions is a branch • 5-op VLIW ? Each VLI contains a branch! • Unacceptable ... • INCREASE STRAIGHT LINE CODE • code without branches • 2 Techniques in addition to loop unrolling: • TRACE SCHEDULING • PREDICATION

TRACE SCHEDULING • Parallelism across IF branches vs. LOOP branches • Compiler Support - Two steps: • Trace Selection • Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code • Trace Compaction • Squeeze trace into few VLIW instructions • Need bookkeeping code in case prediction is wrong

Trace Scheduling • Similar to branch prediction in SuperScalar OoO • When things go wrong: execute fix-up code (undo wrong path). Compiler inserts all necessary code.

PREDICATION • Avoid branch prediction by turning branches into conditionally executed instructions: • if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction. • Drawbacks to conditional instructions • Complex conditions reduce effectiveness; • Cannot predicate very large blocks

Predication Branch Prediction Predication

Intel/HP EPIC • Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • IA-64: instruction set architecture; EPIC is type • EPIC = 2nd generation VLIW? • Itanium™ is name of first implementation (2001)

Intel EPIC VLIW Instructions • IA-64 instructions are encoded in bundles, which are 128 bits wide. • Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent • Smaller code size than old VLIW, larger than x86/RISC • Groups can be linked to show independence > 3 instr

Intel EPIC VLIW Instructions

Itanium

Instruction group/Bundle

Intel IA-64 VLIW Instruction groups • Instruction group: a sequence of consecutive instructions with no register data dependences • All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependencies through memory were preserved • An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups

Intel IA-64 VLIW Instruction groups

Itanium (or Itanic as in Titanic) • Highly parallel and deeply pipelined hardware at 800Mhz (2000) • 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process • Hardware checks dependencies (interlocks => binary compatibility over time) • DYNAMICALLY SCHEDULED VLIW • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

Itanium • IA-64 Registers • The integer registers are configured to help accelerate procedure calls using a register stack • 8 64-bit Branch registers used to hold branch destination addresses for indirect branches • 64 1-bit predication registers

IA-64/Itanium registers

Itanium • Both the integer and floating point registers support register rotation for registers 32-128. • Register rotation is designed to ease the task of allocating of registers in software pipelined loops • When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop • Makes the SW-pipelining usable for loops with smaller numbers of iterations

Itanium

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc, kaxiras@ee.upatras.gr }

Αρχιτεκτονικές VLIW Στέφανος Καξίρας { kaxiras@cs.wisc, kaxiras@ee.upatras.gr }

Presentation Transcript

Stress, Burnout, Humor, and Happiness David Mays, MD, PhD dvmays@wisc

Methods of Protein Purification

Interpretation of the WISC-IV

Integrating ACRE, SURE, and Crop Insurance: Producer Strategies for 2010

Wechsler Individual Achievement Test - Second UK Edition

Health Literacy

FORM I-9 EMPLOYMENT ELIGIBILITY VERIFICATION

Diversity Update 2010 September 2010

Diversity Update 2010 September 2010 (Original Version) Updated Retention and Graduation (Feb. 2011)

Chapter 5 Overview

Diversity Update 2011 September 2011

Diversity Update 2012 October 2012

Triggering on Particle Types: Calorimeter and Muon Based Triggers

Lazy Logic

High Throughput Computing On Campus High Throughput Computing: How we got here, Where we are

STATISTICS 542 Introduction to Clinical Trials

Data Analysis framework in Brain Imaging STAT 992: Image Analysis March 23, 2004 Moo K. Chung

Probabilistic Methods for Interpreting Electron-Density Maps