500 likes | 706 Views
Out-of-Order Execution Structures. Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97. MIPS R10000-Like Design . Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase. Fetch Phase. Decode: Parse instruction
E N D
Out-of-Order Execution Structures ECE1773 - Fall ‘07 ECE Toronto
Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97 MIPS R10000-Like Design ECE1773 - Fall ‘07 ECE Toronto
Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase Fetch Phase ECE1773 - Fall ‘07 ECE Toronto
Decode: Parse instruction Shuffle opcode parts to appropriate ports for rename Decode Phase ECE1773 - Fall ‘07 ECE Toronto
Rename: Map Architectural registers to Physical Eliminate False Dependences Passes renamed instructions to scheduler Called Dispatch Renaming Phase ECE1773 - Fall ‘07 ECE Toronto
Wakeup: Instructions check whether they become ready From Writeback: physical register names Select: Amongst the ready select those to execute Structural hazards Scheduling Phase ECE1773 - Fall ‘07 ECE Toronto
Read source operands Register File Read Phase ECE1773 - Fall ‘07 ECE Toronto
Bypass and Execute Phase ECE1773 - Fall ‘07 ECE Toronto
Data Cache Access Phase ECE1773 - Fall ‘07 ECE Toronto
Write result to register file Broadcast tag in order to wakeup waiting instructions Notice that the tag broadcast should happen TWO cycles in advance of the result production Writeback Phase ECE1773 - Fall ‘07 ECE Toronto
Reservation Station Model • Used by Pentium Pro, PowerPC 604 • Re-order buffer holds values • Renaming points to re-order buffer entries • Tomasulo-like ECE1773 - Fall ‘07 ECE Toronto
Physical Register File vs. Reservation Station • Physical Register File • Values reside in the register file • At writeback instructions broadcast the register name • Reservation Stations: • Values reside: • In the register file upon commit • Non-speculative • In reservation stations prior to commit • Speculative ECE1773 - Fall ‘07 ECE Toronto
Quantifying Complexity • Critical Path Delay as a function of architectural parameters • Instruction Window size (WinSize) • Issue Width (IW) • Full-custom Implementations • Study the critical path • Delay model • Extrapolate how it will scale with “future” technologies ECE1773 - Fall ‘07 ECE Toronto
Renaming • Inputs: • IW instructions • Up to 2 x Input register names • Up to 1 x Output register name • Outputs: • 2 x input physical registers • 1 x new output physical register • 1 x previous physical register name for checkpointing • Updated rename table • Superscalar Issue complicates things a bit ECE1773 - Fall ‘07 ECE Toronto
s1 s1 s2 s2 old d d Renaming One Instruction new reg from free list Write port p0 RAT 2 For mispeculation recovery Read port 1 Read port 1 1 Read port p31 ECE1773 - Fall ‘07 ECE Toronto
new d new d new d new d d Old d d Old d ps2 s2 ps2 s2 ps1 ps1 s1 s1 Renaming Two Instructions Cross Bundle Dependency Check Logic RAT ? ? ? ECE1773 - Fall ‘07 ECE Toronto
Renaming More Instructions • Dependency Checking logic for instruction i must match against all preceding destinations • If there are multiple matches it must enforce priority: • Pick the one closest to this instruction ECE1773 - Fall ‘07 ECE Toronto
RAT: SRAM Implementation bitlines SRAM cell decoder Arch reg #ARCH REGS lg(#PHYS REGS) Sense amp Phys reg ECE1773 - Fall ‘07 ECE Toronto
SRAM RAT cell ECE1773 - Fall ‘07 ECE Toronto
RAT: CAM Implementation • One CAM per physical register • Active bit indicates the current map • New version by setting active bit CAM cell Arch reg Active bit encoder Phys reg #PHYS REGS lg(#ARCH REGS) ECE1773 - Fall ‘07 ECE Toronto
CAM Cell ECE1773 - Fall ‘07 ECE Toronto
SRAM vs. CAM • SRAM: • Arch reg rows • Lg(phy reg) cols • SRAM read/write • CAM: • Phy reg rows • Lg(arch reg) cols • CAM match • Update: • Reset previous valid bit • Set current valid bit ECE1773 - Fall ‘07 ECE Toronto
Scheduler: Part #1 - Wakeup ECE1773 - Fall ‘07 ECE Toronto
Tree of Arbiters GRANT Signals REQ Signals Root enabled if FU available Anyreq raised if any req is active, Grant Issued if arbiter enabled Scheduler: Part #2 - Select For a Single FU Location based select policy ECE1773 - Fall ‘07 ECE Toronto
Select for more than one FUs • Handling Multiple FUs of Same Type: • Stack Select logic blocks in series - hierarchy • Mask the Request granted to previous unit • NOT Feasible for More than 2 FUs • Alternative: • statically partition issue window among FUs – MIPS R10000, HP PA 8000 ECE1773 - Fall ‘07 ECE Toronto
Datapath and Bypass Commonly Used Layout: Turn on Tri-State A to pass result of FU1 to left operand of FU0 1 Bit-Slice ECE1773 - Fall ‘07 ECE Toronto
Complexity Analysis • Critical path delay as a function of: • Issue Width • Window Size • Register Renaming Table • Wakeup and Select • Bypass paths ECE1773 - Fall ‘07 ECE Toronto
Methodology • A representative CMOS design is selected from published alternatives • Implemented the circuits for 3 technologies: • 0.8micron, 0.35micron and 0.18 micron • Optimize for speed • Wire parasitics in delay model • Rmetal, Cmetal ECE1773 - Fall ‘07 ECE Toronto
Methodology • Feature size scaling: 1 / S • Voltage scaling: 1 / U • Logic Delay = (CLx V) / I • Capac. Load: CL= 1 1 / S • Supply Voltage: V = 1 1 / U • Average charge/discharge current: I = 1 1 / U • So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S ECE1773 - Fall ‘07 ECE Toronto
Wire Delay • L: wire length • Intrinsic RC delay • Rmetal: resistance per unit length • Cmetal: capacitance per unit length • 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C ECE1773 - Fall ‘07 ECE Toronto
Wire Delay Scaling • Metal Thickness doesn’t scale much • Width ~ 1/S • Rmetal ~ S • Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate • Parallel plate – scales with 1 / S • Cmetal ~ S • Length scales with 1/S • Overall Scale factor: S x S x (1/S)2 = 1 • Wire delay remains constant ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Table ECE1773 - Fall ‘07 ECE Toronto
r1 r4 r4 r4 r4 Dependency Checking Logic • Accessed in Parallel with Map Table • Every Logical Reg compared against logical dest regs of current rename group • For IW=2,4,8, delay less than map table ECE1773 - Fall ‘07 ECE Toronto
Renaming Delay • SRAM scheme • Delay Components: • Time to decode the arch reg index • Time to drive wordline • Time to pull down bit line • Time for SenseAmp to detect pull-down • MUX time ignored as control from dep. Check logic comes in advance ECE1773 - Fall ‘07 ECE Toronto
Renaming Circuit ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay • Predecoding for speed • Length of predecode lines: • Cellheight: Height of single cell excluding wordlines • Wordline spacing • NVREG: # of virtual reg-s • x3: 3-operand instr-s ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay • Tnand fall delay of NAND • Tnor rise delay of NOR • Rnandpd NAND pull-down channel resistance + Predecode line metal resistance • Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap. ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay • Substitute • Predecode line length, Req and Ceq we get: • c2: intrinsic RC delay of predecode line • c2 very small • Decoder delay ~linearly dependent on IW ECE1773 - Fall ‘07 ECE Toronto
Rename Delay • Wordline • c2: intrinsic RC delay of wordline • c2 very small • Wordline delay ~linearly dependent on IW ECE1773 - Fall ‘07 ECE Toronto
Bitline: c2 very small Bitline delay ~linearly dependent on IW SenseAmp delay ~linearly dependent on IW Rename Delay ECE1773 - Fall ‘07 ECE Toronto
Feature size - [increase in bitline&wordline delay with increasing IW] 0.8um: IW 2 8 Bitline delay + 37% 0.18um: IW 28 Bitline delay + 53% Total delay increases linearly with IW Each Component shows linear increase with IW Bitline Delay > Wordline Delay Bitline length ~ # of Logical reg-s Wordline length ~ width of physical reg designator Rename Logic Delay Scaling IW impact on delay worsenswith decreasing feature size ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay • Critical Path: Mismatch Pull ready signal low • Delay Components: • Tag drivers drive tag lines - vertical • Mismatched bit: pull down stack pull matchline low – horizontal • Final OR gate or all the matchlines of an operand tag • Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C • Quadratic component significant for IW>2 & 0.18um ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay • Quadratic component Small for both cases • Both delays ~linearly dependent on IW ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: IW and Window Size • 0.18um Process • Quadratic dependence • Issue width has greater effect increase all 3 delay components • As IW & WinSize + together delay actually changes like: THIS ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Window Size • 8 way & 0.18 Process • Tag drive delay increases rapidly with WinSize + • Match OR delay constant ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Feature size • 8 way & 64 entry window • Tag drive and Tag match delays do not scale as well as MatchOR delay • Match OR logic delay • Others also have wire delays ECE1773 - Fall ‘07 ECE Toronto
Selection Logic and Bypass Delay • Selection • Logarithmically dependent on WinSize • Bypass: Delay dependent on (IW)2 ECE1773 - Fall ‘07 ECE Toronto