320 likes | 428 Views
Realizing High IPC Through a Scalable, Multipath Microarchitecture. David Kaeli Northeastern University Computer Architecture Research Laboratory Boston, MA USA. The Team. David Morano Alireza Khalafi Marcos de Alba Northeastern University Boston, MA USA. Augustus Uht
E N D
Realizing High IPC Through a Scalable, Multipath Microarchitecture David Kaeli Northeastern University Computer Architecture Research Laboratory Boston, MA USA
The Team David Morano Alireza Khalafi Marcos de Alba Northeastern University Boston, MA USA Augustus Uht Sean Langford* University of Rhode Island Kingston, RI USA (*now at CMU)
The Road to High IPC • Many studies have concluded that typical programs (e.g., SPECint) contain a significant amount of Instruction Level Parallelism (ILP) • Lam and Wilson reported an IPC of ~40 for SP-CD-MF (speculative execution, perfect control dependence information, multi-path execution) • Gonzalez and Gonzalez reported an IPC of ~37 for an infinite instruction window, but no value prediction (IPC went down to just under 10 for a 128 entry instruction window) • So why are we still living with low, single-digit, IPC’s??? • Nobody has been aggressive enough!!!
Machine Philosophy • Issue a column of instructions on every cycle (not always possible) • Spend the rest of the time executing, squashing, snarfing and re-executing as necessary to preserve true control flow and data flow dependencies • Retire instructions at a rate of a column at a time • Design a datapath that is scalable in terms of latency as the size of the machine grows • ISA independent
Outline for this Talk • Overview of the Levo microarchitecture • Discussion of scalability within the Levo datapath • Disjoint execution • Simulation methodology and results • Comments and summary
Levo Microarchitectural Features • In-order instruction load, in-order retirement, rampantly out-of-order execution • Active stations – a more intelligent version of Tomasulo’s reservation stations • Instruction/operand/memory/predicate time tags – used to enforce data and control dependencies in a distributed fashion • Hardware runtime predication – used for all BBs with targets within the execution window • Distributed register file – reduces contention for a shared register file • Aggressive speculation – execute instructions, independent of any data flow or control flow dependencies • Disjoint execution to cover control hazards • Limit study with real hardware constraints
In-order Instruction Load • Instructions are fetched in static order from I-cache, except: • Unconditional jump paths are followed • Loops are dynamically unrolled • Conditional branches with far targets (the target is greater than 2/3rds the size of the execution window), if the branch is strongly predicted taken, begin static fetching from the target • A conventional 2-level gshare branch predictor is used • Dynamic run-time predicates are generated so that every branch domain in the Execution Window is control independent • Nullify operations are broadcast to cause dependent instructions to re-execute Microarchitecture
Microarchitecture Memory Window I-Cache n x m Time-ordered Execution Window
Active Stations • More intelligent version of Tomasulo reservation stations • Each AS holds: • A single instruction • Instruction operands • A time tag denoting its logical position in the execution window • Each AS shares a processing element with a number of other AS’s (as defined by the size of a sharing group) Microarchitecture
Active Stations • Communicate with other active stations in order to: • Snoop for the latest operand values • Forward the results to other active stations • Request a value from other active stations • Re-execute its instruction with new operand values • Handles control flow changes through runtime predication Microarchitecture
Time Tags • Enforce the nominal sequential order of the instructions executed • Accompany all in-flight register values, memory values and predicate values • Have two parts • Column tag – is decremented by 1 whenever the left-most column is loaded • Row tag – does not change Row Column Microarchitecture
Execution Window Sharing Group Column m-1 Column 0 AS(0,m-1) Row 0 Row 0 1 AS(1,m-1) 1 2 2 3 3 PE AS(2,m-1) n-1 n-1 AS(3,m-1) n rows by m columns A sharing group of 4 mainline ASs sharing a single PE Microarchitecture
LD LD LD = = Active Station Operand Snooping and Snarfing result operand forwarding bus time tag address value AS time tag time tag address value path >= != < time tag address value path timetag execute or re-execute Microarchitecture
Last Snarfed Instruction, Time Tag Instruction Result Time Tag (LSTT). Number (ResTT) In Active Station . R4 = 1 R4 = 1 1 1. R4a = 1 – R4 = 2 5 5. R4 = 2 – R4b = 2 9 R3 = R4 9. R3 = R4 1, then 5 R3 = R4b Sequential Out-of-Order (OOO) Execution. Out-of-Order (OOO) Execution. Execution - I9 only snarfs I5 result - I1 result and ResTT broadcast, (at end, (at end, R3 holds ‘2’) – R3 = 1, LSTT = 1 R3 holds ‘2’) - I5 result and ResTT broadcast, – R3 = 2, LSTT = 5 (at end, R3 holds ‘2’) (Same result if I5 broadcasts first; LSTT is set to and stays at ‘5’; I1 result not snarfed by I9.) (a) Program Code (b) With Renaming (c) With Time Tags
Scalable Microarchitecture • Time tags size grows linearly with the total number of ASs • No reorder buffer (typically grows O(n2)) • No centralized architected register file • Register forwarding units hold the ISA-defined register state • Forwarding transactions maintain state • Segmented result buses – fixed length • Distributed L0 caching in the datapath
Observation About Register Lifetimes • The MultiScalar Project demonstrated that register lifetimes are short (spanning 1-2 basic blocks, within 32 instructions) • If we have instructions laid out in a time-ordered fashion, the probability we will have to forward in time very far is low • As a result, we can segment our interconnection fabric, assuming that communicates will only span either the current, or at most the next, segment
Segmented Buses (Spanning Buses) • Use segmented buses to propagate execution results to later stations • Adjacent segments are interconnected with Forwarding Units (one forwarding unit, per bus, per column) • Register Forwarding/Filter Units (RFUs) hold a version of the ISA register state • Memory Forwarding/Filter Units (MFUs) and Predicate Forwarding Units (PFUs) are also provided • Backwarding buses are also provided • The number of I/Os to a FU is independent of the machine size and only depends on the column height • Segmented buses help to preserve scalability in our datapath Microarchitecture
from previous column from previous column from previous column M D M D M D FU FU FU AS AS AS AS AS AS AS AS AS AS AS AS M D M D M D FU FU FU AS AS AS AS AS AS AS AS AS AS AS AS M D M D M D FU FU FU AS AS AS AS AS AS AS AS AS AS AS AS FU FU FU to next column to next column to next column
Register Forwarding/Filter Units • Capture the persistent register state • All buses are register transaction buses • Consolidate update transactions on input • Updates are forwarded to the output bus request logic immediately when possible • Requests are “filtered” based on time-tag value • Updates are managed in the file store in FIFO order backward in time forward in time backwarding read buses backwarding write bus ISA register file per path primary backwarding read bus forwarding read buses logic logic primary forwarding read bus forwardwarding write bus time-tag
Memory Forwarding/Filter Units • Serve as an L0 cache • All buses are memory buses (number of which set according to interleave factor) • Consolidate update transactions on input • Updates are "forwarded" to the output bus request logic immediately when possible • Requests are “filtered” based on time-tag value • Current policy is to queue outgoing requests or responses in FIFOs until the buses are granted for use backward in time forward in time backwarding write buses memory cache FIFO backwarding read buses logic logic forwardwarding write buses forwarding read buses FIFO time-tag
Disjoint Path Execution • Levo can only obtain high IPC if: • we can provide a large window of instructions to execute • a large percentage of the instructions on the eventual committed control-flow path are included in the window • To address the issues with hard-to-predict conditional control flow, we utilize disjoint path spawning in Levo and DEE
Disjoint Path Execution • To enable path spawning we provide a disjoint path (D-path) set of AS’s that share a processing element with a mainline set of AS’s • D-paths are spawned in the case of hammock branches • The D-path is copied from the mainline path • The sign of the associated predicate is inverted for the D-path • The D-path receives lower priority for the PE than the mainline • When a hammock branch is mispredicted, we can treat the D-path as the new mainline path, and continue execution accordingly and DEE
A100 LW R2,20(R4) A104 SUB R2,R2,#1 A108 BEQZ R2,TAR1 Label Addr Instruction History START: A100 LW R2,20(R4) A104 SUB R2,R2,#1 A108 BEQZ R2,TAR1 Weakly T A10C ADD R2,R2,#4 A110 SW 30(R4),R2 TAR1: A114 LW R2,30(R4) A118 SUB R2,R2,#8 A11C BEQZ R2,TAR2 Weakly NT A120 SW 20(R4),R2 TAR2: A124 ADD R2,R2,#10 A128 SUB R1,R1,#1 A12C BNEQZ R1,START Strongly T A130 SW 40(R4),R2 . . A10C ADD R2,R2,#4 A110 SW 30(R4),R2 A114 LW R2,28(R4) A118 SUB R2,R2,#8 A11C BEQZ R2, TAR2 A120 SW 20(R4),R2 A124 ADD R2,R2,#10 A128 SUB R1,R1,#1 A12C BNEQZ R1,START Mainline path Disjoint path A130 SW 40(R4),R2
Modeling and Results • Present work utilizes • MIPS-1/MIPS-2 machine • SGI compiler • SPECint 95 (compress, go and ijpeg) and 2000 (bzip2, crafty, gcc, gzip, mcf, parser and vertex) benchmarks • 3 levels of modeling • Trace-driven model (FastLevo) – results in this presentation • Detailed cycle-accurate model (LevoSim) – still under development • Synthesizable VHDL hardware model (HDLevo) – validation • Design space exploration • Impact of D-paths • Real vs. ideal memory • Range of bus latency issues performance
Modeling parms performance
Modeling parms performance
IPC obtained with Levo performance
Speedup obtained using D-paths versus single path execution(harmonic means) performance
IPC of Levo compared to modeling 100% L1 I/D hitsharmonic means performance
Summary of additional experiments • Varying the L1-D/L2 hit time (versus 1 cycle) • Increased L1-D HT to 2/4/8 cycles = 10/22/43% IPC loss • Increased L2 HT to 2/4/8/16 cycles = .8/2.3/4.7/8.9% IPC loss • Varying the number of buses per FU • Decreased to 1 bus/FU = 14% IPC loss • Increased to 4 buses/FU = 3% IPC gain • Removal of stride predictor = .8% IPC loss • Varying the number of columns per D-path • Increased to 2 cols/D-path = 8% IPC loss • Use of D-paths = 45% IPC gain • Varying the number of branch prediction tables • Decreased from 1 per row to a single of same total size = .4% IPC loss performance
Comments and Future Directions • I-fetch is the main barrier to further gains in IPC • The use of a detailed VHDL model of critical components in Levo has allowed us to design scalable resources • A number of novel microarchitectural features are present in a single design • Future challenges in Levo include: • Improved I-fetch – (EV8, trace cache, dynamic D-paths) • Finish design of an ARB-like memory • Consider compiler support to aid in-order issue and D-path execution • Consider multithreaded extensions to support coarse-grained multithreading
To learn more about visit: http://www.ece.neu.edu/info/architecture/research/Levo.html Also see our paper at europar02.