1 / 21

Lecture 24: Instruction Level Parallelism

Lecture 24: Instruction Level Parallelism. Computer Engineering 585 Fall 2001. Limits to Multi-Issue Machines. Inherent limitations of ILP: 1 branch in 5 : How to keep a 5-way VLIW busy? Latencies of units : many operations must be scheduled.

ovidio
Download Presentation

Lecture 24: Instruction Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 24: Instruction Level Parallelism Computer Engineering 585 Fall 2001

  2. Limits to Multi-Issue Machines • Inherent limitations of ILP: • 1 branch in 5: How to keep a 5-way VLIW busy? • Latencies of units: many operations must be scheduled. • Need as many independent operations as Pipeline Depth xNo. Function Units to keep machines busy, e.g. 5 x 4 = 15–20 independent instructions? • Difficulties in building HW: • Easy: More instruction bandwidth. • Easy: Duplicate FUs to get parallel execution. • Hard: Increase ports to Register File (bandwidth). • VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg files. • Harder: Increase ports to memory (bandwidth). • Decoding Superscalar and impact on clock rate, pipeline depth?

  3. Limits to Multi-Issue Machines • Limitations specific to either Superscalar or VLIW implementation: • Decode issue in Superscalar: how wide practical? • VLIW code size: unroll loops + wasted fields in VLIW. • IA-64 compresses dependent instructions, but still larger. • VLIW lock step => 1 hazard & all instructions stall. • IA-64 not lock step? Dynamic pipeline? • VLIW & binary compatibility is a practical weakness as vary number FU and latencies over time. • IA-64 provides binary compatibility.

  4. Limits to ILP • Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about: • Benchmarks (vectorized Fortran FP vs. integer C programs). • Hardware sophistication. • Compiler sophistication. • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

  5. Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided. 2. Branch prediction–perfect; no mispredictions. 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available. 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal. 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle.

  6. Upper Limit to ILP: Ideal Machine(Figure 4.38, page 319) FP: 75 - 150 Integer: 18 - 60 IPC

  7. More Realistic HW: Branch ImpactFigure 4.40, Page 323 FP: 15 - 45 Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle Integer: 6 - 12 IPC Perfect Pick Cor. or BHT BHT (512) Profile No prediction

  8. Selective History Predictor 8096 x 2 bits 1 0 Taken/Not Taken 11 10 01 00 Choose Non-correlator Branch Addr Choose Correlator 2 Global History 00 8K x 2 bit Selector 01 10 11 11 Taken 10 01 Not Taken 00 2048 x 4 x 2 bits

  9. More Realistic HW: Register ImpactFigure 4.44, Page 328 FP: 11 - 45 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction Integer: 5 - 15 IPC Infinite 256 128 64 32 None

  10. More Realistic HW: Alias ImpactFigure 4.46, Page 330 Integer: 4 - 9 FP: 4 - 45 (Fortran, no heap) Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers IPC Perfect Global/Stack perf;heap conflicts Inspec.Assem. None

  11. Realistic HW for ‘9X: Window Impact(Figure 4.48, Page 332) FP: 8 - 45 Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window IPC Integer: 6 - 12 Infinite 256 128 64 32 16 8 4

  12. 3 1996 Era Machines Alpha 21164 PPro HP PA-8000 Year 1995 1995 1996 Clock 400 MHz 200 MHz 180 MHz Cache 8K/8K/96K/2M 8K/8K/0.5M 0/0/2M Issue rate 2int+2FP 3 instr (x86) 4 instr Pipe stages 7-9 12-14 7-9 Out-of-Order 6 loads 40 instr (µop) 56 instr Rename regs none 40 56

  13. SPECint95base Performance (July 1996)

  14. SPECfp95base Performance (July 1996)

  15. 3 1997 Era Machines Alpha 21164 Pentium II HP PA-8000 Year 1995 1996 1996 Clock600 MHz (‘97) 300 MHz (‘97) 236 MHz (‘97) Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M Issue 2int+2FP 3 instr (x86) 4 instr Pipe stages 7-9 12-14 7-9 Out-of-Order 6 loads 40 instr (µop) 56 instr Rename none 40 56

  16. 3 2000-1 Era Machines Alpha 21364 Power4 Penitum4 Year 2000 2001 2000-1 Clock1GHz MHz (’01?) >1GHz (2001)2GHz (2001) Cache 64K/64K/1.75M 32K/64K/1.5M/32M 12K microops trace cache/8K(D)/256K Issue 2int+2FP 8 inst . 6 inst. Pipe stages 7-9 15-20 20+ Out-of-Order 6 loads 200 inst. 126 inst. Rename none > 200 128

  17. SPECint95base Performance (Oct. 1997)

  18. SPECfp95base Performance (Oct. 1997)

  19. Summary • Branch Prediction: • Branch History Table: 2 bits for loop accuracy. • Recently executed branches correlated with next branch? • Branch Target Buffer: include branch address & prediction. • Predicated Execution can reduce number of branches, number of mispredicted branches. • Speculation: Out-of-order execution, In-order commit (reorder buffer). • SW Pipelining: • Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead. • Superscalar and VLIW: CPI < 1 (IPC > 1) • Dynamic issue vs. Static issue. • More instructions issue at same time => larger hazard penalty.

  20. Review: Who Cares About the Memory Hierarchy? µProc 60%/yr. 1000 • Processor Only Thus Far in Course: • CPU cost/performance, ISA, Pipelined Execution. CPU-DRAM Gap • 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip) CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

  21. Processor-Memory Performance Gap “Tax” Processor % Area %Transistors (­cost) (­power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • 2 dies per package • Caches have no inherent value, only try to close performance gap.

More Related