KILO-INSTRUCTION PROCESSORS

KILO-INSTRUCTION PROCESSORS Arzucan Özgür Department of Computer Engineering Boğaziçi University 15.12.2005 Cmpe 511

Introduction

Memory Wall 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 RAM 7%/yr. • Performance improvements of high-frequency micro-processors is seriously limited by main memory access latencies RAM 1 1980 1983 1984 1986 1987 1990 1994 1995 1981 1982 1985 1988 1989 1991 1992 1993 1996 1997 1998 1999 2000 Time

Reducing Memory Latency

Memory L1 Instr. L2 Branch misprediction L1 Data Next IP Next IP Fetch Fetch Drive Alloc. Rename Rename Queue Schedule Schedule Schedule Dispatch Dispatch Reg. Read Reg. Read Execute Flags Br. chk Drive Cache memory hierarchies • Cache memory hierarchies • First level (L1) cache built into the processor core • Takes 1-3 processor clock cycles to access • If there is a miss in the L1 cache  on-chip L2 cache accessed in the order of 10 processor cycles • Accessing main memory takes at least in the order of 100 processor cycles • Prefetching data from memory to the cache • Prefetch addresses hard to predict

Out-of-order superscalar processors

Sequence of instructions containing data cashe misses

Kilo-Instruction Processors

Definition • An out-of-order superscalar processor that supports thousands of “in-flight instructions” • Intelligent use of resources

Scalability • Thousands of In-flight Instructions and In-Order Commit make designs impractical: • ROB : Needs to maintain a copy of every in-flight instruction • IQs : Instructions depending on long latency instructions remain in these queues for a long time • LSQs : Instructions remain in the queue until commit • Registers : A new physical register for each instruction producing a new value • We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements

Efficient Kilo-Instruction Processor Design • Multi-Checkpointing the ROB • Out-of-Order Commit • Early Release of Resources • Ephemeral Registers • Load Queues

Checkpointing

Checkpointing • ROB allows of the restoration of the correct state at any instruction (not necessary) • Checkpoint a snapshot of the processor state taken at a specific instruction of the program being executed (checkpoint processor state for a subset of instructions) • With this snapshot the processor can restore state to that point in case of an exception or misprediction

Design Decisions • How many in-flight checkpoints should be maintained by the processor? • large number of checkpoints reduce the penalty of the recovery process • large number of checkpointsincrease the implementation cost • What kind of instructions should be checkpointed? • take acheckpoint at any instruction • some instructions are better candidates (ex:some current processors take checkpoints atbranch instructions in order to minimize the branch misprediction penalty) • How much information should be kept by each checkpoint?

Multicheckpointing

Selective Checkpointing • Replace ROB  Pseudo-ROB • Processor removes instructions that reach the pseudo-ROB’s head at fixed rate • Processor state is recovarable for any instruction in the pseudo-ROB • Checkpoint taken when incomplete instruction leaves the pseudo-ROB

Instruction Queue Management

Bi-level Issue Queue • Processor detects instructions that will hold an issue queue for a long time • Removes this instructions from primary issue queue • Offloads them to slow-lane instruction queue  larger, slower, less complex • Same principle applied to load-store queue

Physical Register File

Ephemeral Registers • A conventional superscalar processor assigns registers to architected registers when an instruction enters the issue queue • An instruction reserves a physical register for its entire flight time • A physical register not written a value until much later  primary function is tracking data dependencies • Use virtual registers  late register allocation • Release register if no other instruction that reads the data  early release

Performance Evaluation

Kilo-Instruction Multiprocessors

Ideal Network

References • Adrian Cristal, Oliverio J. Santana, Francisco Cazorla, Marco Galluzzi, Tanausu Ramirez, Miquel Pericas, Mateo Valero. "Kilo-Instruction Processors: Overcoming the Memory Wall," IEEE Micro, vol. 25, no. 3, pp. 48-57, May/June, 2005. • A. Cristal, O. Santana, M. Valero, and J.F. Martínez. Toward kilo-instruction processors. In ACM Trans. on Architecture and Code Optimization, Vol. 1, No. 4, Dec. 2004 • Marco Galluzzi, Valentin Puente, Adrián Cristal, Ramón Beivide, José-Ángel Gregorio, Mateo Valero, A first glance at Kilo-instruction based multiprocessors, Conf. Computing Frontiers 2004: 212-221

Thank you!

KILO-INSTRUCTION PROCESSORS

KILO-INSTRUCTION PROCESSORS

Presentation Transcript

Design Tradeoffs in Instruction Window of Superscalar Processors

Instruction-Level Parallel Processors

Energy Efficient Instruction Cache for Wide-issue Processors

Flexicache: Software-based Instruction Caching for Embedded Processors

Design Automation of Co-Processors for Application Specific Instruction Set Processors

Computer Architecture Instruction-Level Parallel Processors

Scaling Towards Kilo-Core Processors with Asymmetric High-Radix Topologies

KILO 1000 Units

Macro instruction synthesis for embedded processors

The prefix kilo!

Kilo-instruction Processors

Novel Multimedia Instruction Capabilities in VLIW Media Processors

Kilo-instruction Processors

Kilo-instruction Processors

Towards Optimal Custom Instruction Processors

Kilo

Instruction Level Parallelism and Superscalar Processors

DLL-Conscious Instruction Fetch Optimization for SMT Processors

Instruction Generation and Regularity Extraction for Reconfigurable Processors

CH14 Instruction Level Parallelism and Superscalar Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors