710 likes | 846 Views
Procesadores Superescalares. Prof. Mateo Valero. Las Palmas de Gran Canaria 26 de Noviembre de 1999. Initial developments. Mechanical machines 1854: Boolean algebra by G. Boole 1904: Diode vacuum tube by J.A. Fleming 1946: ENIAC by J.P. Eckert and J. Mauchly
E N D
Procesadores Superescalares Prof. Mateo Valero Las Palmas de Gran Canaria 26 de Noviembre de 1999
Initial developments • Mechanical machines • 1854: Boolean algebra by G. Boole • 1904: Diode vacuum tube by J.A. Fleming • 1946: ENIAC by J.P. Eckert and J. Mauchly • 1945: Stored program by J.V. Neuman • 1949: EDSAC by M. Wilkes • 1952: UNIVAC I and IBM 701
Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass Data Cache Superscalar Processor Fetch of multiple instructions every cycle. Rename of registers to eliminate added dependencies. Instructions wait for source operands and for functional units. Out- of -order execution, but in order graduation. Scalable Pipes
Technology Trends and Impact Delay in Psec. Issue Width= 4Issue Width= 8 ROB Size = 32ROB Size = 64 S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.
Physical Scalability Die reachable (percent) 0,25 0,18 0,13 0,1 0,08 0,06 Processor generation (microns) Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.
Register influence on ILP 8-way fetch/issue window of 256 entries up to 1 taken branch g-share 64k entries One cycle latency • Spec95
Register File Latency • 66% and 20% performance improvement when moving from 2 to 1-cycle latency
Outline • Virtual-physical register • A register file cache • VLIW architectures
Virtual-Physical Registers • Motivation • Conventional renaming scheme • Virtual-Physical Registers Icache Decode&Rename Commit Register used Register unused Register used
Example Cache miss: 20 Fdiv: 20 Fmul: 10 Fadd: 5 load f2, 0(r4) fdiv f2, f2, f10 fmul f2, f2, f12 fadd f2, f2, 1 load p1, 0(r4) fdiv p2, p1, p10 fmul p3, p2, p12 fadd p4, p2, 1 rename • Register pressure: average registers per cycle Conventional: 3.6 Virtual-Physical: 0.7
Virtual-Physical register • Physical register play two different roles • Keep track of dependences (decode) • Provide a storage location for results (write-back) • Proposal: Three types of registers • Logical: Architected registers • Virtual-Physical (VP): Keep track of dependences • Physical: Store values • Approach • Decode: rename from logical to VP • Write-back (or issue): rename from VP to physical
R2 General Map Table Phy. Map Table R1 Lreg VP Preg V Preg Inst. queue Src2 Src1 D ROB Lreg VPreg C Virtual-Physical Registers • Hardware support VPreg Fetch Execute Decode Write-back Commit Issue
Virtual-Physical Registers • No free physical register • Re-execute but… if it is the oldest instruction… • Avoiding deadlock • A number (NRR) of registers are reserved for the oldest instructions • 21% speedup for Spec95 on a 8-way issue [HPCA-4] • Conclusions • Optimal NRR is different for each program • For a given program, best NRR may be different for different sections of code
Performance evaluation SimpleScalar OoO with modified renaming 8-way issue RUU: 128 entries FU (latency) 8 Simple int. (1) 4 Int Mult (7) 6 Simple FP (4) 4 FP Mult (4) 4 FP Div (16) 4 mem ports L1 Dcache 32 KB, 2-way, 32 B/line, 1 cycle L1 Icache 32 KB, 2-way, 64 B/line, 1 cycle L2 cache 1 MB, 2-way, 64 B/line, 12 cycles Main memory 50 cycles Branch prediction 18-bit Gshare 2 taken branches Benchmarks: SPEC95 Compac/Dec compilers -O5 Virtual-Physical Registers
Virtual-Physical Registers • Performance evaluation
Virtual-Physical Registers • What is the optimal allocation policy ? • Approximation • Registers should be allocated to the instructions that can use them earlier (avoid unused registers) • If some instruction should be stall because of the lack of registers, choose the latest instructions (delaying the earliest would also delay the commit of the latest) • Implementation • Each instruction allocates a physical register in the write-back. If none available, it steals the register from the latest instruction after the current
DSY Performance SpecInt95 SpecFp99
Performance and Number of Registers SpecIn95 SpecFp95
Outline • Virtual-physical register • A register file cache • VLIW architecture
Register File Latency • 66% and 20% performance improvement when moving from 2 to 1-cycle latency
Register File Bypass SpecInt95
Register File Bypass SpecFP95
Register File Cache • Organization • Bank 1 (Register File) • All registers (128) • 2-cycle latency • Bank 2 (Reg. File Cache) • A subset of registers (16) • 1-cycle latency RF RFC
OoO simulator 8-way issue/commit Functional Units (lat.) 2 Simple integer (1) 3 Complex integer Mult. (2) Div. (14) 4 Simple FP (2) 2 FP div.: 2 (14) 3 Branch (1) 4 Load/Store 128-entry ROB 16-bit Gshare Icache and Dcache 64 KB 2-way set-associative 1/8-cycle hit/miss Dcache: Lock-up free-16 outstanding misses Benchmarks Spec95 DEC compiler -O4 (int.) -O5 (FP) 100 million after inizialitations Access time and area models Extension to Wilton&Jouppi models Experimental Framework
Caching Policy (1 of 3) • First policy • Many values (85%-Int and 84%-FP) are used at most once • Thus, only non-bypassed values are cached • FIFO replacement RF RFC
Performance • 20% and 4% improvement over 2-cycle • 29% and 13% degradation over 1-cycle
RF RFC Caching Policy (1 of 2) • Second policy • Values that are sources of any non-issued instruction with all its operands ready • Not issued because of lack of functional units • or, the other operand in in the main register file
Performance • 24% and 5% improvement over 2-cycle • 25% and 12% degradation over 1-cycle
Caching Policy (1 of 3) • Third policy • Values that are sources of any non-issued instruction with all its operands ready • Prefetching • Table that for each physical register indicates which is the other operand of the first instruction that uses it • Replacement: give priority to those values already read at least once
Performance • 27% and 7% improvement over 2-cycle • 24% and 11% degradation over 1-cycle
Speed for Different RFC Architectures Taken into account access time SpecInt95
Conclusions • Register file access time is critical • Virtual-physical registers significantly reduce the register pressure • 24% improvement for SpecFP95 • A register file cache can reduce the average access time • 27% and 7% improvement for a two-level, locality-based partitioning architecture
High performance instruction fetch through a software/hardware cooperation Alex Ramirez Josep Ll. Larriba-Pey Mateo Valero UPC-Barcelona
Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass Data Cache Superscalar Processor Fetch of multiple instructions every cycle. Rename of registers to eliminate added dependencies. Instructions wait for source operands and for functional units. Out- of -order execution, but in order graduation. J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.
Motivation Branch /Jump outcome Instruction Fetch & Decode Instruction Execution • Instruction Fetch rate important not only in steady state • Program start-up • Miss-speculation points • Program segments with little ILP Instruction Queue(s)
Motivation • Instruction fetch effectively limits the performance of superscalar processors • Even more relevant at program startup points • More aggressive processors need higher fetch bandwidth • Multiple basic block fetching becomes necessary • Current solutions need extensive additional hardware • Branch address cache • Collapsing buffer: multi-ported cache • Trace cache: special purpose cache
PostgreSQL 64KB I1, 64KB D1, 256KB L2 L=0 B L B
Programs Behaviour 64KB I1, 64KB D1, 256KB L2
Fetch Address • Scalar Fetch Unit • Few instructions per cycle • 1 branch • Limitations • Prediction accuracy • I-cache miss rate • Prev. work, code reordering • Fisher (IEEE Tr. on Comp. 81) • Hwu and Chang (ISCA’89) • Petis and Hansen (Sigplan’90) • Torrellas et al. (HPCA’95) • Kalamatianos et al. (HPCA’98) Instruction Cache (i-cache) Branch Prediction Mechanism Next Address Logic Shift & Mask Scalar Fetch Unit To Decode Next Fetch Address Software, reduce cache misses The Fetch Unit (1 of 3)
Hardware, form traces at run time The Fetch Unit (2 of 3) Fetch Address • Aggressive Fetch Unit • Lot of instructions per cycle • Several branches • Limitations • Prediction accuracy • Sequentiality • I-cache miss rate • Prev. work, trace building • Yeh et al. (ICS’93) • Conte et al. (ISCA’95) • Rottenberg et al. (MICRO’96) • Friendly et al. (MICRO’97) Instruction Cache (i-cache) Branch Target Buffer Return Stack Multiple Branch Predictor Next Address Logic Shift & Mask Aggressive Core Fetch Unit To Decode Next Fetch Address
Trace Cache b0 Trace is a sequence of logically contiguos instructions. Trace cache line stores a segment of the dynamic instruction traces across multiple, potentially, taken branches:(b1-b2-b4, b1-b3-b7….) It is indexed by fetch address and branches outcome History-based fetch mecanism. b1 b3 b2 b6 b7 b4 b5 b8
Fetch Address Instruction Cache (i-cache) Branch Target Buffer Return Stack Multiple Branch Predictor Next Address Logic Shift & Mask Aggressive Core Fetch Unit The Fetch Unit (3 of 3) Trace Cache (t-cache) Fill Buffer Trace Cache aims at forming traces run time To Decode Next Fetch Address From Fetch or Commit
Our Contribution • Mixed software-hardware approach • Optimize performance at compile-time • Use profiling information • Make optimum use of the available hardware • Avoid redundant work at run-time • Do not repeat what was done at compile-time • Adapt hardware to the new software • Software Trace Cache • Profile-directed code reordering & mapping • Selective Trace Storage • Fill Unit modification