An Approach for Implementing Efficient Superscalar CISC Processors

An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu,Ilhyun Kim, Mikko Lipasti, James Smith Ilhyun Kim

Processor Design Challenges • CISC challenges -- Suboptimal internal micro-ops. • Complex decoders & obsolete features/instructions • Instruction count expansion: 40% to 50%  mgmt, comm … • Redundancy & Inefficiency in the cracked micro-ops • Solution: Dynamic optimization • Other current challenges (CISC & RISC) • Efficiency (Nowadays, less performance gain per transistor) • Power consumption has become acute • Solution: Novel efficient microarchitectures HPCA 2006, Austin, TX

Conventional HW design Software Binary Translator VM paradigm Decoders Pipeline Code $ Pipeline Solution: Architecture Innovations Software in Architected ISA: OS, Drivers, Lib code, Apps Architected ISA e.g. x86 Dynamic Translation Implementation ISA e.g. fusible ISA HW Implementation: Processors, Mem-sys, I/O devices • ISA mapping: • Hardware: Simple translation, good for startup performance. • Software: Dynamic optimization, good for hotspots. • Can we combine the advantages of both? • Startup: Fast, simple translation • Steady State: Intelligent translation/optimization, for hotspots.

Microarchitecture: Macro-op Execution • Enhanced OoO superscalar microarchitecture • Process & execute fused macro-ops as single Instructions throughout the entire pipeline • Analogy: All lanes  car-pool on highway  reduce congestion w/ high throughput, AND raise the speed limit from 65mph to 80mph. 3-1 ALUs cache ports Fuse bit Decode Wake- Align Retire WB EXE RF MEM Select Fetch Rename Fuse up Dispatch HPCA 2006, Austin, TX

Related Work: x86 processors • AMD K7/K8 microarchitecture • Macro-Operations • High performance, efficient pipeline • Intel Pentium M • Micro-op fusion. • Stack manager. • High performance, low power. • Transmeta x86 processors • Co-Designed x86 VM • VLIW engine + code morphing software. HPCA 2006, Austin, TX

Related Work • Co-designed VM: IBM DAISY, BOA • Full system translator on tree regions + VLIW engine • Other research projects: e.g. DBT for ILDP • Macro-op execution • ILDP, Dynamic Strands, Dataflow Mini-graph, CCG. • Fill Unit, SCISM, rePLay, PARROT. • Dynamic Binary Translation / Optimization • SW based: (Often user mode only) UQBT, Dynamo (RIO), IA-32 EL. Java and .NET HLL VM runtime systems • HW based: Trace cache fill units, rePLay, PARROT, etc HPCA 2006, Austin, TX

horizontal Memory Hierarchy micro / Macro - op decoder VM translation / optimization software x86 code vertical x86 decoder Pipeline EXE backend Rename/ Dispatch Issue buffer I - $ Code $ (Macro-op) Co-designed x86 processor architecture 1 2 • Co-designed virtual machine paradigm • Startup: Simple hardware decode/crack for fast translation • Steady State: Dynamic software translation/optimization for hotspots. HPCA 2006, Austin, TX

- Core 32-bit instruction formats 10 b opcode 21-bit Immediate / Displacement - / 16-bit immediate / Displacement - 5b Rds 11b Immediate / Disp 5b Rsrc Add-on 16-bit instruction formats for code density - F Fusible ISA Instruction Formats Fusible Instruction Set • RISC-ops with unique features: • A fusible bit per instr. for fusing • Dense encoding, 16/32-bit ISA • Special Features to Support x86 • Condition codes • Addressing modes • Aware of long immediate values F F 10 b opcode 5b Rds F 10 b opcode 5b Rsrc F 16 bit opcode 5b Rsrc 5b Rds 5b opcode 10b Immediate / Disp F 5b opcode 5b Immd 5b Rds F 5b Rds 5b opcode 5b Rsrc

Macro-op Fusing Algorithm • Objectives: • Maximize fused dependent pairs • Simple & Fast • Heuristics: • Pipelined Scheduler: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops • Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. • Simplicity: 2 or less distinct register operands per fused pair • Solution: Two-pass Fusing Algorithm: • The 1st pass, forward scan, prioritizes ALU ops, i.e. for each ALU-op tail candidate, look backward in the scan for its head • The 2nd pass considers all kinds of RISC-ops as tail candidates

Fusing Algorithm: Example x86 asm: ----------------------------------------------------------- 1. lea eax, DS:[edi + 01] 2. mov [DS:080b8658], eax 3. movzx ebx, SS:[ebp + ecx << 1] 4. and eax, 0000007f 5. mov edx, DS:[eax + esi << 0 + 0x7c] RISC-ops: ----------------------------------------------------- 1. ADD Reax, Redi, 1 2. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. AND Reax, 0000007f 5. ADD R17, Reax, Resi 6. LD Redx, mem[R17 + 0x7c] After fusing: Macro-ops ----------------------------------------------------- 1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c] HPCA 2006, Austin, TX

Instruction Fusing Profile • 55+% fused RISC-ops  increases effective ILP by 1.4 • Only 6% single-cycle ALU ops left un-fused. HPCA 2006, Austin, TX

x86Pipeline X86 x86 x86 Fetch EXE Align Rename Dispatch wakeup Payload RF WB Retire Select Decode1 Decode2 Decode3 Macro-op Pipeline - Align/ Fetch EXE Decode Rename Dispatch wakeup Payload RF WB Retire Select Fuse Pipelined 2-cycle Issue Logic Processor Pipeline Reduced Instr. traffic throughout Reduced forwarding Pipelined scheduler • Macro-op pipeline for efficient hotspot execution • Execute macro-ops • Higher IPC, and Higher clock speed potential • Shorter pipeline front-end

16 Bytes Fetch Align / Align / 1 1 2 2 3 3 4 4 5 5 6 6 5 5 3 3 Fuse Fuse 1 1 2 2 4 4 3 3 3 5 5 5 Decode Decode 1 1 2 2 2 4 4 4 Rename Rename slot 0 slot 0 slot 1 slot 1 slot 2 slot 2 Dispatch Dispatch Co-designed x86 pipeline frond-end HPCA 2006, Austin, TX

2-cycle Macro-op Scheduler Wakeup issue port 0 issue port 1 issue port 2 Select lane 0 lane 0 lane 1 lane 1 lane 2 lane 2 Payload dual entry dual entry dual entry dual entry dual entry dual entry lane 0 lane 0 lane 1 lane 1 lane 2 lane 2 RF 2 read ports 2 read ports 2 read ports 2 read ports 2 read ports 2 read ports EXE 3 - 1ALU2 ALU0 ALU1 ALU2 3 - 1ALU0 3 - 1ALU1 WB/ Mem Mem Port 1 Mem Port 0 Co-designed x86 pipeline backend HPCA 2006, Austin, TX

Experimental Evaluation • x86vm: Experimental framework for exploring the co-designed x86 virtual machine paradigm. • Proposed co-designed x86 processor – A specific instantiation of the framework. • Software components: VMM – DBT, Code caches, VM runtime control and resource management system (Extracted some source code from BOCHS 2.2) • Hardware components: Microarchitecture timing simulators, Baseline OoO Superscalar, Macro-op Execution, etc. • Benchmarks: SPEC2000 integer HPCA 2006, Austin, TX

Performance Evaluation: SPEC2000 

Performance Contributors • Many factors contribute to the IPC performance improvement: • Code straightening, • Macro-op fusing and execution. • Reduce pipeline front-end (reduce branch penalty) • Collapsed 3-1 ALUs (resolve branches & addresses sooner). • Besides baseline and macro-op models, we model three middle configurations: • M0: baseline + code cache • M1: M0 + macro-op fusing. • M2: M1 + shorter pipeline front-end. (Macro-op mode) • Macro-op: M2 + collapsed 3-1 ALUs. HPCA 2006, Austin, TX

Performance Contributors: SPEC2000 HPCA 2006, Austin, TX

Conclusions • Architecture Enhancement • Hardware/Software co-designed paradigm  enable novel designs & more desirable system features • Fuse dependent instruction pairs  collapse dataflow graph to increase ILP • Complexity Effectiveness • Pipelined 2-cycle instruction scheduler • Reduce ALU value forwarding network significantly • DBT software reduces hardware complexity • Power Consumption Implication • Reduced pipeline width • Reduced Inter-instruction communication and instruction management HPCA 2006, Austin, TX

Finale – Questions & Answers Suggestions and comments are welcome, Thank you! HPCA 2006, Austin, TX

Outline • Motivation & Introduction • Processor Microarchtecture Details • Evaluation & Conclusions HPCA 2006, Austin, TX

Performance Simulation Configuration  HPCA 2006, Austin, TX

Fuse Macro-ops: An Illustrative Example HPCA 2006, Austin, TX

Translation Framework Dynamic binary translation framework: 1. Form hotspot superblock. Crack x86 instructions into RISC-style micro-ops 2. Perform Cluster Analysis of embedded long immediate values and assign to registers if necessary. 3. Generate RISC-ops (IR form) in the implementation ISA 4. Construct DDG (Data Dependency Graph) for the superblock 5. Fusing Algorithm: Scan looking for dependent pairs to be fused. Forward scan, backward pairing. Two-pass to prioritize ALU ops. 6. Assign registers; re-order fused dependent pairs together, extend live ranges for precise traps, use consistent state mapping at superblock exits 7. Code generation to code cache HPCA 2006, Austin, TX

Other DBT Software Profile • Of all fused macro-ops: • 50%  ALU-ALU pairs. • 30%  fused condition test & conditional branch pairs. • Others  mostly ALU-MEM ops pairs. • Of all fused macro-ops: • 70+% are inter-x86instruction fusion. • 46% access two distinct source registers, • only 15% (6% of all instruction entities) write two distinct destination registers. • Translation Overhead Profile • About 1000+ instructions per translated hotspot instruction. HPCA 2006, Austin, TX

Dependence Cycle Detection • All cases are generalized to (c) due to Anti-Scan Fusing Heuristic HPCA 2006, Austin, TX

HST back-end profile • Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk. • Heavy-weight opts: uops translation, fusing, codegen – none dominates HPCA 2006, Austin, TX

Hotspot Coverage vs. runs HPCA 2006, Austin, TX

Hotspot Detected vs. runs HPCA 2006, Austin, TX

Performance Evaluation: SPEC2000  HPCA 2006, Austin, TX

Performance evaluation (WSB2004)  HPCA 2006, Austin, TX

Performance Contributors (WSB2004) HPCA 2006, Austin, TX

Future Directions • Co-Designed Virtual Machine Technology: • Confidence: More realistic benchmark study – important for whole workload behavior such as hotspot behavior and impact of context switches. • Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques. • Application: Specific enabling techniques for specific novel computer architectures of the future. • Example co-designed x86 processor design: • Confidence Study as above. • Enhancement: HW μ-Arch  Reduce register write ports. VMM  More dynamic optimizations in HST, e.g. CSE, software stack manager, SIMDification. HPCA 2006, Austin, TX

An Approach for Implementing Efficient Superscalar CISC Processors

An Approach for Implementing Efficient Superscalar CISC Processors

Presentation Transcript

Superscalar Processors

Implementing an Organisation Wide Testing Approach

Computer Architecture Superscalar Processors

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors*

Power Efficient Comparators for Long Arguments in Superscalar Processors

Compiler Support for Superscalar Processors

Superscalar Processors

CSL718 : Superscalar Processors

Physical Design of FabScalar Generated Superscalar Processors

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors

An Efficient Packet Scheduling Algorithm in Network Processors

Problems with Superscalar approach

Problems with Superscalar approach

Superscalar Processors

Multiple Issue Processors: Superscalar and VLIW

Instruction Level Parallelism and Superscalar Processors

Problems with Superscalar approach

Superscalar Processors

Lecture 8 Shelving in Superscalar Processors (Part 1)

CH14 Instruction Level Parallelism and Superscalar Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Superscalar Processors