150 likes | 300 Views
Future Superscalar Processors Based on Instruction Compounding. Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith. Instruction Compounding (Fusing). Instruction compounding, or “fusing” has become a key idea in high performance microprocessors
E N D
Future Superscalar Processors Based on Instruction Compounding Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith
Instruction Compounding (Fusing) Instruction compounding, or “fusing” has become a key idea in high performance microprocessors “A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions” “Instructions composing a compound instruction need not be consecutive.” -- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994 Future Microprocessors
The Future Processor: Three Key Aspects • Instruction compounding or fusing • Based on S. Vassiliadis work • Employs compounding and 3-input ALU • Co-designed VM for dynamic translation/fusing • Concealed from all software • Optimized (fused) instructions held in code-cache • Dual decoder front-end for fast startup • Hardware front-end decoder for fast startup • Software translator for sustained high performance Future Microprocessors
Processor Micro-architecture Future Microprocessors
Fusible Instruction Set • RISC-ops with unique features: • A fusible bit per instruction fuses two dependent instructions • Dense instruction encoding, 16/32-bit ISA design • Special Features to Support the x86 ISA • Condition codes • Addressing modes • Aware of long immediate & displacement values Future Microprocessors
Microarchitecture: Macro-op Execution • Enhanced OOO superscalar microarchitecture • Process & execute fused macro-ops as single Instructions throughout the entire pipeline Future Microprocessors
Macro-op Fusing Algorithm • Objectives: • Maximize fused dependent pairs • Simple & Fast • Heuristics: • Pipelined Scheduler: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops • Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. • Simplicity: 2 or fewer distinct register operands per fused pair • Solution: Two-pass Fusing Algorithm: • The 1st pass, forward scan, prioritizes ALU ops, i.e. for each ALU-op tail candidate, look backward in the scan for its head • The 2nd pass considers all kinds of RISC-ops as tail candidates Future Microprocessors
Fusing Algorithm: Example x86 asm: ----------------------------------------------------------- 1. lea eax, DS:[edi + 01] 2. mov [DS:080b8658], eax 3. movzx ebx, SS:[ebp + ecx << 1] 4. and eax, 0000007f 5. mov edx, DS:[eax + esi << 0 + 0x7c] RISC-ops: ----------------------------------------------------- 1. ADD Reax, Redi, 1 2. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. AND Reax, 0000007f 5. ADD R17, Reax, Resi 6. LD Redx, mem[R17 + 0x7c] After fusing: Macro-ops ----------------------------------------------------- 1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c] Future Microprocessors
Instruction Fusing Profile • 55+% fused RISC-ops increases effective ILP by 1.4 • Only 6% single-cycle ALU ops left un-fused. Future Microprocessors
Other DBT Software Profile • Of all fused macro-ops: • 50% ALU-ALU pairs. • 30% fused condition test & conditional branch pairs. • Others mostly ALU-MEM ops pairs. • Of all fused macro-ops: • 70+% are inter-x86instruction fusion. • 46% access two distinct source registers, • only 15% (6% of all instruction entities) write two distinct destination registers. • Translation Overhead Profile • About 1000 instructions per translated hotspot instruction. Future Microprocessors
Co-designed x86 Processor Performance Future Microprocessors
Dual Decoder Front-End Future Microprocessors
Evaluation: Startup Performance Future Microprocessors
Activity of HW Assists Future Microprocessors
Important Research Issues • Profiling • Probe insertion via software translator not feasible • Multi-core • Shared code cache • SMT designs • Memory consistency • Stores can be done in-order • Re-scheduled loads may be important for performance • Precise traps • Potential HW assist? Future Microprocessors