Exploiting Instruction-Level Parallelism with Software Approaches

1. Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

2. Overview Basic Compiler Techniques Pipeline scheduling loop unrolling Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP Detecting loop-level parallelism Software pipelining � symbolic loop unrolling Global code scheduling Hardware support for exposing more parallelism Conditional or predicted instructions Compiler speculation with hardware support Hardware vs Software speculation mechanisms Intel IA-64 ISA

3. Review of Multi-issue Taxonomy

4. Quote about IA-64 Architecture �One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that an EPIC processor is less complex than a superscaler processor. It�s hard to know why this is so, but one can speculate that the overall complexity involved in focusing on CPI, as IA-64 does, makes it hard to get high megahertz.� - M. Hopkins, 2000

5. Basic Pipeline Scheduling To keep pipeline full Find sequences of unrelated instructions to overlap Separate dependent instructions by at least the latency of source instruction Compiler success depends on: Amount of ILP available Latencies of functional units

6. Assumptions for Examples Standard 5-stage integer pipeline plus floating point pipeline Branches have delay of 1 cycle Integer load latency of 1 cycle, ALU latency of 0 Functional units fully pipelined or replicated so that there are no structural hazards Latencies between dependent FP instructions:

7. Loop Example Add a scalar to an array. for (i=1000; i>0; i=i-1) x[i] = x[i] + s; Iterations of the loop are parallel with no dependencies between iterations.

8. Straightforward Conversion R1 holds the address of the highest array element F2 holds the scalar R2 is pre-computed so that 8(R2) is the last element loop: L.D F0, 0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer (DW) BNE R1,R2, loop ;branch if R1 != R2

9. Program in MIPS Pipeline Clock cycle issued loop: L.D F0, 0(R1) 1 Stall 2 ADD.D F4,F0,F2 3 Stall 4 Stall 5 S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 Stall 8 BNE R1,R2, loop 9 Stall 10

Exploiting Instruction-Level Parallelism with Software Approaches

Exploiting Instruction-Level Parallelism with Software Approaches

Presentation Transcript

Instruction Level Parallelism

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Instruction-level Parallelism

Instruction Level Parallelism

Instruction-Level Parallelism

Instruction-Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Instruction Level Parallelism

Instruction Level Parallelism

Exploiting Instruction-Level Parallelism with Software Approaches

Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Instruction-Level Parallelism

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism: Loop Level Parallelism

Instruction-Level Parallelism

Instruction-level Parallelism