Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research

L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 Multithreadedscalar IO core 2-wayOO core Motivation Adaptive(Federation) Homogeneous Heterogeneous

Basic Insights • A multithreaded in-order core has many registers which can be reused for a reorder buffer oractive list • If cores are small, single cycle communication between neighbors is feasible • Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible

Fetch Decode In-order & Out-of-order Pipelines In-order Out-of-order Fetch Bpred Decode Execute Execute Mem Allocate Mem Writeback Rename Writeback Issue Commit

Ready Bits Subscriber Slot 1 Subscriber Slot 2 1 2 3 4 5 Issue Queue Example + 1 1 1 IQ2 IQ3 1 0 1 IQ3 + 2 0 1 0 1 + 3 Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002 Sassone et al., Matrix Scheduler Reloaded, ISCA 2007

Simplified Load-Store Queue • Memory Alias Table (MAT) • No store forwarding • No conservative waiting on stores • Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005

MAT 0 0 0 1 0 2 0 3 0 4 5 0 0 6 7 0 MAT Example st 0x13, r5 ld r1, 0x13

MAT 0 0 1 0 0 2 3 1 4 0 5 0 6 0 7 0 MAT Example st 0x13, r5 ld r1, 0x13 EXE ld executes and increments counter

MAT 0 0 1 0 2 0 1 ! 3 0 4 0 5 0 6 7 0 MAT Example st 0x13, r5 COM ld r1, 0x13 st commits and sets flag

MAT 0 0 1 0 2 0 1 ! 3 0 4 0 5 0 6 7 0 MAT Example ld r1, 0x13 COM Flush ld commits, sees flag, and flushes pipeline

MAT 0 0 0 1 0 2 3 0 4 0 5 0 0 6 7 0 MAT Example ld r1, 0x13 MAT is reset and execution resumes

Performance Impact

Performance

Energy Efficiency

Area Efficiency

Conclusions • Two in-order cores can be federated at run-time to form a 2-way OO core • Almost doubling IPC of throughput core is possible with very little extra hardware • Don’t want traditional OO structures because their performance comes at too high a price • Best combined area- and energy-efficiency

Q & A

Backup

Core Fusion Data Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007

Overall Results • Scalar in-order core is 8KB I/D, 256KB L2 • Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred • 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred

Branch Prediction • Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) • NLS ok if your instruction working set not > I$ size • Small bimodal predictor ik ok for small window processor

Fetch • Two I$’s act as a I$ of twice the size and associativity (and random replacement) • More logic and buffers to capture two instructions • Extra cycle to route instructions from two I$’s to two decoders

Decode • Cancel second instruction if first turns out to be branch • Extra cycle to route decoded instructions to new allocate stage

Allocate • New logic and free lists to allocate ROB, IQ entries

Rename • New table since it has too many ports • One, centralized rename table, not distributed • Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue)

Issue • Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) • Centralized, one IQ for the two cores

Register File • Register file is mirrored in the two cores • No extra copy instructions or load-balancing questions

Execute • Add extra cycle for copying result to other core’s register file (like EV6)

Memory Access • The two D$s are checked in parallel, each responsible for half of the merged D$’s ways • No standard LSQ, only a Memory Alias Table (details later) • Only detects ordering violations and send signal to pipeline

Commit • Centralized commit, no slippage • Recover from branch mispredictions since no checkpoints of RAT on branches • Recover from memory order violations (or false positives) from MAT

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Presentation Transcript

In-Order vs. Out-of-Order

Batting Out of Order

Order out of Chaos

Issue a Warning Order

Batting Out of Order

Multiple Instruction Issue

Tomasulo Scheduling for Out-Of-Order Execution

Content Repurposing

ProfileMe : Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

Out-of-Order Execution

Instruction Issue Logic

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors

Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a

Two-issue Super Scalar CPU

Is Out-Of-Order Out Of Date ?

Correlation Between State and Control Signals in Out-of-order Issue Logic

Batting Out of Order

Correlation Between State and Control Signals in Out-of-Order Issue Logic

In-order vs. Out-of-order Execution

Out-of-Order Speculative Execution