650 likes | 769 Views
1. Evolution of ILP-processing. Dezső Sima Fall 2006. D. Sima, 2006. Structure. 1. Paradigms of ILP-processing. 2 . Introduction of temporal parallelism. 3 . Introduction of issue parallelism. 3.1. VLIW processing. 3.2. Supercalar processing. 4. Introduction of data parallelism.
E N D
1. Evolution of ILP-processing Dezső Sima Fall 2006 D. Sima, 2006
Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook
ENIAC NORC CDC-6600 Cray-1 Cray-2 Cray-3 Cray T3E ? super- computer Cray-4 UNIVAC /360 /370 /390 z/900 mainframe PDP-8 PDP-11 VAX x minicomputer RS/6000 Xeon server/workstation PPro 4004 8080 8088 microcomputer 80286 80486 PII PIII P4 desktop PC Altair 80386 Pentium 8088 value PC Celeron 1950 1960 1970 1980 1990 2000 1. Paradigms of ILP-processing 1.1. Introduction (1) Figure 1.1: Evolution of computer classes
1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors
1.2. Paradigms of ILP-processing (1) Paradigms of ILP-processing Issue parallelism Temporal parallelism Static dependency resolution Pipeline processors VLIW processors
VLIWprocessing Independent instructions (static dependency resolution) F E F E F E Processor VLIW: Very Large Instruction Word Instructions
1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors
Superscalarprocessing VLIWprocessing Independent instructions (static dependency resolution) Dependent instructions Dynamicdependency resolution F E F E F E F E F E F E Processor Processor VLIW: Very Large Instruction Word Instructions
1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Data parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution SIMD extension Pipeline processors VLIW processors Superscalar processors
1.2. Paradigms of ILP-processing (2) Issueparallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Dynamic dependency resolution Pipeline processors. Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types
Absolute performance Ideal case Real case Sequential Pipeline VLIW/ superscalar SIMD extension 1.3. Performance potential of ILP-processors (1)
Clock frequency Data parall. Temporal parall. Issue parall. Efficiency of spec. exec. 1.3. ILP processzorok teljesítménypotenciálja (2) Performance components of ILP-processors: with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application
Pipeline processors Overlapping all phases E W D F i i i i +1 i i +2 i i +3 37 Atlas (1963) 38 IBM 360/91 (1967) 41 R2000 (1988) 42 i80386 (1985) 43 M68030 (1988) 2. Introduction of temporal parallelism 2.1. Introduction (1) Types of temporal parallelism in ILP processors (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism
x86 80386 80486 80286 M68000 68030 68020 68040 R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors 2.1. Introduction (2) Figure 2.2: The appearance of pipeline processors
2.2.1. Overview 2.2. Processing bottlenecks evoked and their resolution The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)
2.2.2. The scarcity of memory bandwidth (1) Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth
C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors without cache(s) Pipeline (scalar) processors with cache(s) Universal cache (size in kB) C(n) Instruction/data cache (sizes in kB) C(n/m) 2.2.2. The scarcity of memory bandwidth (2) Figure 2.3: Introduction of caches
bc Conditional branch bti Branch target instruction 2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) clock cycles bc D E W F ii F D E ii+1 F D ii+2 Decode ii+4 bti F Conditionchecking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline
Basic block Basic block Conditional branches Instructions other than conditional branches Guessed path Approved path 2.2.3. The problem of branch processing (2) Figure 2. 5: Principle of branch prediction in case of a conditional branch
C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 (Scalar)pipeline processors Speculative execution of branches 2.2.3. The problem of branch processing (3) Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors
2.3. Generations of pipeline processors (1) Cache Speculative branch processing 1. generation pipelined no no 1.5. generation pipelined yes no 2. generation pipelined yes yes
80386 68030 68040 R4000 R6000 2.3. Generations of pipeline processors (2) C(8) 80286 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 1. generation pipelined (no cache, no speculative branch processing) 1.5.generation pipelined (cache, no speculative branch processing) 2.generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors
2.4. Exhausting the available temporal parallelism 2. generation pipeline processors already exhaust the available temporal parallelism
3. Introduction of issue parallelism 3.1. Options to implement issue parallelism VLIW (EPIC)instruction issue Static dependency resolution (3.2) Superscalarinstruction issue Pipeline processing Dynamic dependency resolution (3.3)
E E E E U U U U 3.2. VLIW processing (1) Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) Figure 3.1: Principle of VLIW processing
3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler
The term ‘VLIW’ 3.2. VLIW processing (3) Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997
3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP
3.2. VLIW processing (5) Drawbacks of static dependency resolution: Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions
3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth
3.2. VLIW processing (7) Commercial VLIW processors: Trace(1987) Multiflow Cydra-5 (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors
3.2. VLIW processing (8) VLIW EPIC Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64 Itanium
3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA
3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.2. Attributes of first generation superscalars (1) • 2-3 RISC instructions/cycle or • 2 CISC instructions/cycle „wide” Width: Core: • Static branch prediction • Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: • Alpha 21064 Examples: • PA 7100 • Pentium
3.3.2. Attributes of first generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.2. Attributes of first generation superscalars (3) Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU
(a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process 3.3.3. The bottleneck evoked and its resolution (1) The issue bottleneck Figure 3.5: The principle of direct issue
3.3.3. The bottleneck evoked and its resolution (2) Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue
3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core
3.3.4. Attributes of second generation superscalars (1) First generation ”narrow” superscalars Second generation ”wide” superscalars • 2-3 RISC instructions/cycle or2 CISC instructions/cycle „wide” • 4 RISC instructions/cycles or3 CISC instruction/cycle „wide” Width: • Static branch prediction • Buffered (ooo) issue • Predecoding • Dynamic branch prediction • Register renaming • ROB Core: Caches: • Single-ported, blocking • L1 data caches • Off-chip L2 caches • attached via the processor bus • Dual-ported, non-blockingL1 data caches • direct attached off-chip L2 caches • Alpha 21064 Examples: • Alpha 21264 • PA 7100 • PA 8000 • Pentium • Pentium Pro • K6
3.3.4. Attributes of second generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997
Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990
3.3.4. Attributes of second generation superscalars (3) Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU
3.3.5. Exhausting the issue parallelism In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level
4. Introduction of data parallelism 4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism
4.1. Overview (2) Superscalar extension SIMD instructions (FX/FP) Multiple operations within a single instruction Superscalar issue EPIC extension Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors
4.2. The appeareance of SIMD instructions in superscalars (1) Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars
A 2.5. and 3. generation superscalars (1) 2.5. generation superscalars Second generation superscalars FX SIMD (MM) 3. generation superscalars FX SIMD + FP SIMD (MM+3D)