Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

Lecture 1: IntroductionInstruction Level Parallelism& Processor Architectures

Instruction Level Parallelism (ILP) • Simultaneous execution of multiple instructions. do { Swap = 0; for (I = 0; I<Last; I++) { if (Tab[I] > Tab[I+1]) { Temp = Tab[I]; Tab[I] = Tab[I+1]; Tab[I+1] = Temp; Swap = 1; } } } while (Swap);

Barriers to detecting ILP • Control dependences • Arise due to conditional branches • Data dependences • Register dependences • Memory dependences

Branches while ((*q j = 0; *q = false; while ((*q == false) && (j != 8)) { j = j + 1; *q = false; if ((b[j] == true) && (a[i+j] == true) && (c[i-j + 7] == true)) { x[i] = j; b[j] = false; a[i+j] = false; c[i-j + 7] = false; if ( …. if (b[j]) if (a[i+j]) if (c[i-j+7]) x[i] = j; ...

Dynamic Branches [%] Frequent Branches • Sequence of branch instructions in the dynamic stream separated by at most one non-branch instruction.

Prediction Accuracy [%] Branch Prediction Accuracy of gshare

If addr!=addr’ If addr!=addr’ Load R2, addr’ Store R1, addr Add R1, R2 Store R2, addr’ Load R1, addr’ Store R5, addr Add R1,R3 Memory Dependences • Reordering of memory instructions, loads and stores, is not always possible. Store R5, addr Store R2, addr’ Load R1, addr’ Add R1,R3 Store R1, addr Load R2, addr’ Add R1, R2

Memory Disambiguation

Value based Store-set disambiguator

Load R2, . Load R3, . Load R1, . Add R2, R3 Load R2, . Load R3, . Add R1, R2 Load R1, . Add R2, R3 Load R1, .. Add R1, R2 Load R4, .. Sub R1, R2 Sub R4, R2 Register Dependences • True data dependences • False data dependences

Window Size vs ILP (issue width = 16)

Parallelism Study - ILP in Spec95

Conclusions • There is ample amount of parallelism to scale the issue width. • Very large instruction windows must be implemented. • A highly accurate memory disambiguation mechanism is required. • Highly accurate branch prediction must be performed. • Register dependences should be avoided.

Processors • Pipelined • Advanced Pipelining • Superscalars • Very Long Instruction Word (VLIW) • Multiprocessors/Multicores

F D E WB M M M F D E WB F D E WB Pipelined Processors • In-order, overlapped execution of instructions. • Eg. 5-stage pipeline • instruction fetch, • decode and register operand fetch, • execute, memory operand fetch, and • write-back results.  MIPS R4000 has an 8 stage pipeline.

Causes of Pipeline Delays • Data dependences - RAW hazards • register bypass and code reordering by the compiler. • Register hazards • WAW hazards -instructions may reach the WB stage out-of-order. • No WAR hazards. • Branch delays • Compiler fills branch delay slots vs hardware performs branch prediction. • Structural hazards due to nonpipelined units. • Register writes when multiple instructions reach WB • stage at the same time (issue vs retire rate).

Advanced Pipelining • In-order issue but Out-of-order execution DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F8, F8, F14 • Execute SUBD before ADDD • Dynamic scheduling – Scoreboard, Tomasulo’s

F F F F D D D D F F D D E E E E E E WB WB WB M M M WB WB M M WB M Superscalar Processors • Multiple instructions can be issued in each cycle. • Speculative Execution is incorporated (commit or discard results).  AMD-K7 is a 9-issue superscalar.  PowerPC is a 4-issue superscalar.

E E F F D D E E WB WB E E E E VLIW • Each long instruction contains multiple operations that are executed in parallel. • Compiler performs speculation and recovery. • Multiflow 500 can issue up to 28 operations in each instruction (instructions can be up to 1024-bits). • Itanium – 128 bit instruction, 3 operations (40-bit), template (8-bits)

Superscalar Hardware branch prediction guides fetching of instructions to fill up the processor’s instruction window. VLIW Programs are first profiled. The compiler uses the profiles to trace out likely paths. A trace is a software instruction window. Control Dependences -Instruction Window Instructions are issued from the window as they become ready, that is, out-of-order execution is possible. Instruction reordering is performed by the compiler within the trace.

Superscalar Memory dependences: HW load-store disambiguation techniques used for enabling out-of-order execution. VLIW Memory dependences: Detected by the compiler using dependency analysis or using address profiling. Data Dependences - Exploiting ILP False data dependences: Avoided by the compiler through renaming (memory) and register allocation. False register dependences: Avoided using register renaming. True data dependences: Must be honored. Value prediction for out-of-order execution of dependent instructions. True data dependences: Are strictly followed. Reordering is possible with HW support.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures