300 likes | 405 Views
Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue. David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2.
E N D
Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 Multithreadedscalar IO core 2-wayOO core Motivation Adaptive(Federation) Homogeneous Heterogeneous
Basic Insights • A multithreaded in-order core has many registers which can be reused for a reorder buffer oractive list • If cores are small, single cycle communication between neighbors is feasible • Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible
Fetch Decode In-order & Out-of-order Pipelines In-order Out-of-order Fetch Bpred Decode Execute Execute Mem Allocate Mem Writeback Rename Writeback Issue Commit
Ready Bits Subscriber Slot 1 Subscriber Slot 2 1 2 3 4 5 Issue Queue Example + 1 1 1 IQ2 IQ3 1 0 1 IQ3 + 2 0 1 0 1 + 3 Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002 Sassone et al., Matrix Scheduler Reloaded, ISCA 2007
Simplified Load-Store Queue • Memory Alias Table (MAT) • No store forwarding • No conservative waiting on stores • Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005
MAT 0 0 0 1 0 2 0 3 0 4 5 0 0 6 7 0 MAT Example st 0x13, r5 ld r1, 0x13
MAT 0 0 1 0 0 2 3 1 4 0 5 0 6 0 7 0 MAT Example st 0x13, r5 ld r1, 0x13 EXE ld executes and increments counter
MAT 0 0 1 0 2 0 1 ! 3 0 4 0 5 0 6 7 0 MAT Example st 0x13, r5 COM ld r1, 0x13 st commits and sets flag
MAT 0 0 1 0 2 0 1 ! 3 0 4 0 5 0 6 7 0 MAT Example ld r1, 0x13 COM Flush ld commits, sees flag, and flushes pipeline
MAT 0 0 0 1 0 2 3 0 4 0 5 0 0 6 7 0 MAT Example ld r1, 0x13 MAT is reset and execution resumes
Conclusions • Two in-order cores can be federated at run-time to form a 2-way OO core • Almost doubling IPC of throughput core is possible with very little extra hardware • Don’t want traditional OO structures because their performance comes at too high a price • Best combined area- and energy-efficiency
Core Fusion Data Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007
Overall Results • Scalar in-order core is 8KB I/D, 256KB L2 • Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred • 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred
Branch Prediction • Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) • NLS ok if your instruction working set not > I$ size • Small bimodal predictor ik ok for small window processor
Fetch • Two I$’s act as a I$ of twice the size and associativity (and random replacement) • More logic and buffers to capture two instructions • Extra cycle to route instructions from two I$’s to two decoders
Decode • Cancel second instruction if first turns out to be branch • Extra cycle to route decoded instructions to new allocate stage
Allocate • New logic and free lists to allocate ROB, IQ entries
Rename • New table since it has too many ports • One, centralized rename table, not distributed • Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue)
Issue • Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) • Centralized, one IQ for the two cores
Register File • Register file is mirrored in the two cores • No extra copy instructions or load-balancing questions
Execute • Add extra cycle for copying result to other core’s register file (like EV6)
Memory Access • The two D$s are checked in parallel, each responsible for half of the merged D$’s ways • No standard LSQ, only a Memory Alias Table (details later) • Only detects ordering violations and send signal to pipeline
Commit • Centralized commit, no slippage • Recover from branch mispredictions since no checkpoints of RAT on branches • Recover from memory order violations (or false positives) from MAT