Procesadores Superescalares

Procesadores Superescalares Prof. Mateo Valero Las Palmas de Gran Canaria 26 de Noviembre de 1999

Initial developments • Mechanical machines • 1854: Boolean algebra by G. Boole • 1904: Diode vacuum tube by J.A. Fleming • 1946: ENIAC by J.P. Eckert and J. Mauchly • 1945: Stored program by J.V. Neuman • 1949: EDSAC by M. Wilkes • 1952: UNIVAC I and IBM 701

Eniac 1946

EDSAC 1949

Pipeline

Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass Data Cache Superscalar Processor Fetch of multiple instructions every cycle. Rename of registers to eliminate added dependencies. Instructions wait for source operands and for functional units. Out- of -order execution, but in order graduation. Scalable Pipes

Technology Trends and Impact Delay in Psec. Issue Width= 4Issue Width= 8 ROB Size = 32ROB Size = 64 S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.

Physical Scalability Die reachable (percent) 0,25 0,18 0,13 0,1 0,08 0,06 Processor generation (microns) Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.

Register influence on ILP 8-way fetch/issue window of 256 entries up to 1 taken branch g-share 64k entries One cycle latency • Spec95

Register File Latency • 66% and 20% performance improvement when moving from 2 to 1-cycle latency

Outline • Virtual-physical register • A register file cache • VLIW architectures

Virtual-Physical Registers • Motivation • Conventional renaming scheme • Virtual-Physical Registers Icache Decode&Rename Commit Register used Register unused Register used

Example Cache miss: 20 Fdiv: 20 Fmul: 10 Fadd: 5 load f2, 0(r4) fdiv f2, f2, f10 fmul f2, f2, f12 fadd f2, f2, 1 load p1, 0(r4) fdiv p2, p1, p10 fmul p3, p2, p12 fadd p4, p2, 1 rename • Register pressure: average registers per cycle Conventional: 3.6 Virtual-Physical: 0.7

Percentage of Used/Wasted Registers

Virtual-Physical register • Physical register play two different roles • Keep track of dependences (decode) • Provide a storage location for results (write-back) • Proposal: Three types of registers • Logical: Architected registers • Virtual-Physical (VP): Keep track of dependences • Physical: Store values • Approach • Decode: rename from logical to VP • Write-back (or issue): rename from VP to physical

R2 General Map Table Phy. Map Table R1 Lreg VP Preg V Preg Inst. queue Src2 Src1 D ROB Lreg VPreg C Virtual-Physical Registers • Hardware support VPreg Fetch Execute Decode Write-back Commit Issue

Virtual-Physical Registers • No free physical register • Re-execute but… if it is the oldest instruction… • Avoiding deadlock • A number (NRR) of registers are reserved for the oldest instructions • 21% speedup for Spec95 on a 8-way issue [HPCA-4] • Conclusions • Optimal NRR is different for each program • For a given program, best NRR may be different for different sections of code

Performance evaluation SimpleScalar OoO with modified renaming 8-way issue RUU: 128 entries FU (latency) 8 Simple int. (1) 4 Int Mult (7) 6 Simple FP (4) 4 FP Mult (4) 4 FP Div (16) 4 mem ports L1 Dcache 32 KB, 2-way, 32 B/line, 1 cycle L1 Icache 32 KB, 2-way, 64 B/line, 1 cycle L2 cache 1 MB, 2-way, 64 B/line, 12 cycles Main memory 50 cycles Branch prediction 18-bit Gshare 2 taken branches Benchmarks: SPEC95 Compac/Dec compilers -O5 Virtual-Physical Registers

Virtual-Physical Registers • Performance evaluation

IPC and NRR

Virtual-Physical Registers • What is the optimal allocation policy ? • Approximation • Registers should be allocated to the instructions that can use them earlier (avoid unused registers) • If some instruction should be stall because of the lack of registers, choose the latest instructions (delaying the earliest would also delay the commit of the latest) • Implementation • Each instruction allocates a physical register in the write-back. If none available, it steals the register from the latest instruction after the current

DSY Performance SpecInt95 SpecFp99

Performance and Number of Registers SpecIn95 SpecFp95

Outline • Virtual-physical register • A register file cache • VLIW architecture

Register Requirements

Register File Latency • 66% and 20% performance improvement when moving from 2 to 1-cycle latency

Register File Bypass SpecInt95

Register File Bypass SpecFP95

Register File Cache • Organization • Bank 1 (Register File) • All registers (128) • 2-cycle latency • Bank 2 (Reg. File Cache) • A subset of registers (16) • 1-cycle latency RF RFC

OoO simulator 8-way issue/commit Functional Units (lat.) 2 Simple integer (1) 3 Complex integer Mult. (2) Div. (14) 4 Simple FP (2) 2 FP div.: 2 (14) 3 Branch (1) 4 Load/Store 128-entry ROB 16-bit Gshare Icache and Dcache 64 KB 2-way set-associative 1/8-cycle hit/miss Dcache: Lock-up free-16 outstanding misses Benchmarks Spec95 DEC compiler -O4 (int.) -O5 (FP) 100 million after inizialitations Access time and area models Extension to Wilton&Jouppi models Experimental Framework

Caching Policy (1 of 3) • First policy • Many values (85%-Int and 84%-FP) are used at most once • Thus, only non-bypassed values are cached • FIFO replacement RF RFC

Performance • 20% and 4% improvement over 2-cycle • 29% and 13% degradation over 1-cycle

RF RFC Caching Policy (1 of 2) • Second policy • Values that are sources of any non-issued instruction with all its operands ready • Not issued because of lack of functional units • or, the other operand in in the main register file

Caching Policy (1 of 3) • Third policy • Values that are sources of any non-issued instruction with all its operands ready • Prefetching • Table that for each physical register indicates which is the other operand of the first instruction that uses it • Replacement: give priority to those values already read at least once

Speed for Different RFC Architectures Taken into account access time SpecInt95

Speed for Different RFC Architectures SpecFp95

Conclusions • Register file access time is critical • Virtual-physical registers significantly reduce the register pressure • 24% improvement for SpecFP95 • A register file cache can reduce the average access time • 27% and 7% improvement for a two-level, locality-based partitioning architecture

High performance instruction fetch through a software/hardware cooperation Alex Ramirez Josep Ll. Larriba-Pey Mateo Valero UPC-Barcelona

Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass Data Cache Superscalar Processor Fetch of multiple instructions every cycle. Rename of registers to eliminate added dependencies. Instructions wait for source operands and for functional units. Out- of -order execution, but in order graduation. J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.

Motivation Branch /Jump outcome Instruction Fetch & Decode Instruction Execution • Instruction Fetch rate important not only in steady state • Program start-up • Miss-speculation points • Program segments with little ILP Instruction Queue(s)

Motivation • Instruction fetch effectively limits the performance of superscalar processors • Even more relevant at program startup points • More aggressive processors need higher fetch bandwidth • Multiple basic block fetching becomes necessary • Current solutions need extensive additional hardware • Branch address cache • Collapsing buffer: multi-ported cache • Trace cache: special purpose cache

PostgreSQL 64KB I1, 64KB D1, 256KB L2 L=0 B L B

Programs Behaviour 64KB I1, 64KB D1, 256KB L2

Fetch Address • Scalar Fetch Unit • Few instructions per cycle • 1 branch • Limitations • Prediction accuracy • I-cache miss rate • Prev. work, code reordering • Fisher (IEEE Tr. on Comp. 81) • Hwu and Chang (ISCA’89) • Petis and Hansen (Sigplan’90) • Torrellas et al. (HPCA’95) • Kalamatianos et al. (HPCA’98) Instruction Cache (i-cache) Branch Prediction Mechanism Next Address Logic Shift & Mask Scalar Fetch Unit To Decode Next Fetch Address Software, reduce cache misses The Fetch Unit (1 of 3)

Hardware, form traces at run time The Fetch Unit (2 of 3) Fetch Address • Aggressive Fetch Unit • Lot of instructions per cycle • Several branches • Limitations • Prediction accuracy • Sequentiality • I-cache miss rate • Prev. work, trace building • Yeh et al. (ICS’93) • Conte et al. (ISCA’95) • Rottenberg et al. (MICRO’96) • Friendly et al. (MICRO’97) Instruction Cache (i-cache) Branch Target Buffer Return Stack Multiple Branch Predictor Next Address Logic Shift & Mask Aggressive Core Fetch Unit To Decode Next Fetch Address

Trace Cache b0 Trace is a sequence of logically contiguos instructions. Trace cache line stores a segment of the dynamic instruction traces across multiple, potentially, taken branches:(b1-b2-b4, b1-b3-b7….) It is indexed by fetch address and branches outcome History-based fetch mecanism. b1 b3 b2 b6 b7 b4 b5 b8

Fetch Address Instruction Cache (i-cache) Branch Target Buffer Return Stack Multiple Branch Predictor Next Address Logic Shift & Mask Aggressive Core Fetch Unit The Fetch Unit (3 of 3) Trace Cache (t-cache) Fill Buffer Trace Cache aims at forming traces run time To Decode Next Fetch Address From Fetch or Commit

Our Contribution • Mixed software-hardware approach • Optimize performance at compile-time • Use profiling information • Make optimum use of the available hardware • Avoid redundant work at run-time • Do not repeat what was done at compile-time • Adapt hardware to the new software • Software Trace Cache • Profile-directed code reordering & mapping • Selective Trace Storage • Fill Unit modification

Procesadores Superescalares

Procesadores Superescalares

Presentation Transcript