1.08k likes | 1.51k Views
Overview. Basic Compiler TechniquesPipeline scheduling loop unrollingStatic Branch PredictionStatic Multiple Issue: VLIWAdvanced Compiler Support for Exposing ILPDetecting loop-level parallelismSoftware pipelining
E N D
1. Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches
2. Overview Basic Compiler Techniques
Pipeline scheduling
loop unrolling
Static Branch Prediction
Static Multiple Issue: VLIW
Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism
Software pipelining symbolic loop unrolling
Global code scheduling
Hardware support for exposing more parallelism
Conditional or predicted instructions
Compiler speculation with hardware support
Hardware vs Software speculation mechanisms
Intel IA-64 ISA
3. Review of Multi-issue Taxonomy
4. Quote about IA-64 Architecture One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that an EPIC processor is less complex than a superscaler processor. Its hard to know why this is so, but one can speculate that the overall complexity involved in focusing on CPI, as IA-64 does, makes it hard to get high megahertz.
- M. Hopkins, 2000
5. Basic Pipeline Scheduling To keep pipeline full
Find sequences of unrelated instructions to overlap
Separate dependent instructions by at least the latency of source instruction
Compiler success depends on:
Amount of ILP available
Latencies of functional units
6. Assumptions for Examples Standard 5-stage integer pipeline plus floating point pipeline
Branches have delay of 1 cycle
Integer load latency of 1 cycle, ALU latency of 0
Functional units fully pipelined or replicated so that there are no structural hazards
Latencies between dependent FP instructions:
7. Loop Example Add a scalar to an array.
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s;
Iterations of the loop are parallel with no dependencies between iterations.
8. Straightforward Conversion R1 holds the address of the highest array element
F2 holds the scalar
R2 is pre-computed so that 8(R2) is the last element
loop: L.D F0, 0(R1) ;F0 = array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer (DW)
BNE R1,R2, loop ;branch if R1 != R2
9. Program in MIPS Pipeline Clock cycle issued loop: L.D F0, 0(R1) 1 Stall 2 ADD.D F4,F0,F2 3 Stall 4 Stall 5 S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 Stall 8 BNE R1,R2, loop 9 Stall 10