290 likes | 363 Views
3.13. Fallacies and Pitfalls. Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster Balance must be found: E.g. sophisticated pipeline: CPI ↓ clock cycle ↑. Fallacies and Pitfalls.
E N D
3.13. Fallacies and Pitfalls • Fallacy: Processors with lower CPIs will always be faster • Fallacy: Processors with faster clock rates will always be faster • Balance must be found: • E.g. sophisticated pipeline: CPI ↓ clock cycle ↑
Fallacies and Pitfalls • Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance • Again, question of balance • SuperSPARC –vs– HP PA 7100 • Complex interactions between cycle time and organisation
Fallacies and Pitfalls • Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement • Amdahl’s Law! • Boosting performance of one area may uncover problems in another
Fallacies and Pitfalls • Pitfall: Sometimes bigger and dumber is better! • Alpha 21264: sophisticated multilevel tournament branch predictor • Alpha 21164: simple two-bit predictor • 21164 performs better for transaction processing application! • Can handle twice as many local branch predictions
Concluding Remarks • Lots of open questions! • Clock speed –vs– CPI • Power issues • Exploiting parallelism • ILP –vs– explicit
Characteristics of Modern (2001) Processors • Figure 3.61 • 3–4 way superscalar • 4–22 stage pipelines • Branch prediction • Register renaming (except UltraSPARC) • 400MHz – 1.7GHz • 7–130 million transistors
4.1. Compiler Techniques for Exposing ILP • Compilers can improve the performance of simple pipelines • Reduce data hazards • Reduce control hazards
Loop Unrolling • Compiler technique to increase ILP • Duplicate loop body • Decrease iterations • Example: • Basic code: 10 cycles per iteration • Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
Loop Unrolling • Basic code: 7 cycles per “iteration” • Scheduled: 3.5 cycles (no stalls!) for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; } for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
Loop Unrolling • Requires clever compilers • Analysing data dependences, name dependences and control dependences • Limitations • Code size • Decrease in amortisation of overheads • “Register pressure” • Compiler limitations • Useful for any architecture
Superscalar Performance • Two-issue MIPS (int + FP) • 2.4 cycles per “iteration” • Unrolled five times
4.2. Static Branch Prediction • Useful: • where behaviour can be predicted at compile-time • to assist dynamic prediction • Architectural support • Delayed branches
Static Branch Prediction • Simple: • Predict taken • Has average misprediction rate of 34% (SPEC) • Range: 59% – 9% • Better: • Predict backward taken, forward not-taken • Worse for SPEC!
Static Branch Prediction • Advanced compiler analysis can do better • Profiling is very useful • FP: 9% ± 4% • Int: 15% ± 5%
4.3. Static Multiple Issue: VLIW • Compiler groups instructions into “packets”, checking for dependences • Remove dependences • Flag dependences • Simplifies hardware
VLIW • First machines used a wide instruction with multiple operations per instruction • Hence Very Long Instruction Word (VLIW) • 64–128 bits • Alternative: group several instructions into an issue packet
VLIW Architectures • Multiple functional units • Compiler selects instructions for each unit to create one long instruction/an issue packet • Example: five operations • Integer/branch, 2 × FP, 2 × memory access • Need lots of parallelism • Use loop unrolling, or global scheduling
Example for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } • Loop unrolled seven times! • 1.29 cycles per result • 60% of available instruction “slots” filled
Drawbacks of Original VLIWs • Large code size • Need to use loop unrolling • Wasted space for unused slots • Clever encoding techniques, compression • Lock-step execution • Stalling one unit stalls them all • Binary code compatibility • Variations on structure required recompilation
4.4. Compiler Support for Exploiting ILP • We will not cover this section in detail • Loop unrolling • Loop-carried dependences • Software pipelining • Interleave instructions from different iterations
4.5. Hardware Support for Extracting More Parallelism • Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time • If not, we need more advanced techniques: • Conditional instructions • Hardware support for compiler speculation
Conditional or Predicated Instructions • Instructions have associated conditions • If condition is true execution proceeds normally • If not, instruction becomes a no-op cmovz %r8, %r1, %r2 bnez %r8, L1 nop mov %r1, %r2 L1: ... if (a == 0) b = c; • Removes control hazards
Conditional Instructions • Control hazards effectively replaced by data hazards • Can be used for speculation • Compiler reorders instructions depending on likely outcome of branches
Limitations on Conditional Instructions • Annulled instructions still execute • But may occupy otherwise stalled time • Most useful when conditions evaluated early • Limited usefulness for complex conditions • May be slower than unconditional operations