210 likes | 317 Views
Lecture 24: Instruction Level Parallelism. Computer Engineering 585 Fall 2001. Limits to Multi-Issue Machines. Inherent limitations of ILP: 1 branch in 5 : How to keep a 5-way VLIW busy? Latencies of units : many operations must be scheduled.
E N D
Lecture 24: Instruction Level Parallelism Computer Engineering 585 Fall 2001
Limits to Multi-Issue Machines • Inherent limitations of ILP: • 1 branch in 5: How to keep a 5-way VLIW busy? • Latencies of units: many operations must be scheduled. • Need as many independent operations as Pipeline Depth xNo. Function Units to keep machines busy, e.g. 5 x 4 = 15–20 independent instructions? • Difficulties in building HW: • Easy: More instruction bandwidth. • Easy: Duplicate FUs to get parallel execution. • Hard: Increase ports to Register File (bandwidth). • VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg files. • Harder: Increase ports to memory (bandwidth). • Decoding Superscalar and impact on clock rate, pipeline depth?
Limits to Multi-Issue Machines • Limitations specific to either Superscalar or VLIW implementation: • Decode issue in Superscalar: how wide practical? • VLIW code size: unroll loops + wasted fields in VLIW. • IA-64 compresses dependent instructions, but still larger. • VLIW lock step => 1 hazard & all instructions stall. • IA-64 not lock step? Dynamic pipeline? • VLIW & binary compatibility is a practical weakness as vary number FU and latencies over time. • IA-64 provides binary compatibility.
Limits to ILP • Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about: • Benchmarks (vectorized Fortran FP vs. integer C programs). • Hardware sophistication. • Compiler sophistication. • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided. 2. Branch prediction–perfect; no mispredictions. 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available. 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal. 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle.
Upper Limit to ILP: Ideal Machine(Figure 4.38, page 319) FP: 75 - 150 Integer: 18 - 60 IPC
More Realistic HW: Branch ImpactFigure 4.40, Page 323 FP: 15 - 45 Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle Integer: 6 - 12 IPC Perfect Pick Cor. or BHT BHT (512) Profile No prediction
Selective History Predictor 8096 x 2 bits 1 0 Taken/Not Taken 11 10 01 00 Choose Non-correlator Branch Addr Choose Correlator 2 Global History 00 8K x 2 bit Selector 01 10 11 11 Taken 10 01 Not Taken 00 2048 x 4 x 2 bits
More Realistic HW: Register ImpactFigure 4.44, Page 328 FP: 11 - 45 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC Infinite 256 128 64 32 None
More Realistic HW: Alias ImpactFigure 4.46, Page 330 Integer: 4 - 9 FP: 4 - 45 (Fortran, no heap) Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers IPC Perfect Global/Stack perf;heap conflicts Inspec.Assem. None
Realistic HW for ‘9X: Window Impact(Figure 4.48, Page 332) FP: 8 - 45 Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window IPC Integer: 6 - 12 Infinite 256 128 64 32 16 8 4
3 1996 Era Machines Alpha 21164 PPro HP PA-8000 Year 1995 1995 1996 Clock 400 MHz 200 MHz 180 MHz Cache 8K/8K/96K/2M 8K/8K/0.5M 0/0/2M Issue rate 2int+2FP 3 instr (x86) 4 instr Pipe stages 7-9 12-14 7-9 Out-of-Order 6 loads 40 instr (µop) 56 instr Rename regs none 40 56
3 1997 Era Machines Alpha 21164 Pentium II HP PA-8000 Year 1995 1996 1996 Clock600 MHz (‘97) 300 MHz (‘97) 236 MHz (‘97) Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M Issue 2int+2FP 3 instr (x86) 4 instr Pipe stages 7-9 12-14 7-9 Out-of-Order 6 loads 40 instr (µop) 56 instr Rename none 40 56
3 2000-1 Era Machines Alpha 21364 Power4 Penitum4 Year 2000 2001 2000-1 Clock1GHz MHz (’01?) >1GHz (2001)2GHz (2001) Cache 64K/64K/1.75M 32K/64K/1.5M/32M 12K microops trace cache/8K(D)/256K Issue 2int+2FP 8 inst . 6 inst. Pipe stages 7-9 15-20 20+ Out-of-Order 6 loads 200 inst. 126 inst. Rename none > 200 128
Summary • Branch Prediction: • Branch History Table: 2 bits for loop accuracy. • Recently executed branches correlated with next branch? • Branch Target Buffer: include branch address & prediction. • Predicated Execution can reduce number of branches, number of mispredicted branches. • Speculation: Out-of-order execution, In-order commit (reorder buffer). • SW Pipelining: • Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead. • Superscalar and VLIW: CPI < 1 (IPC > 1) • Dynamic issue vs. Static issue. • More instructions issue at same time => larger hazard penalty.
Review: Who Cares About the Memory Hierarchy? µProc 60%/yr. 1000 • Processor Only Thus Far in Course: • CPU cost/performance, ISA, Pipelined Execution. CPU-DRAM Gap • 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip) CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (cost) (power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • 2 dies per package • Caches have no inherent value, only try to close performance gap.