290 likes | 440 Views
KILO-INSTRUCTION PROCESSORS. Arzucan Özgür Department of Computer Engineering Boğaziçi University. 15.12.2005 Cmpe 511. Introduction. Memory Wall. 60%/yr. 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year). Performance. 10. RAM 7%/yr.
E N D
KILO-INSTRUCTION PROCESSORS Arzucan Özgür Department of Computer Engineering Boğaziçi University 15.12.2005 Cmpe 511
Memory Wall 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 RAM 7%/yr. • Performance improvements of high-frequency micro-processors is seriously limited by main memory access latencies RAM 1 1980 1983 1984 1986 1987 1990 1994 1995 1981 1982 1985 1988 1989 1991 1992 1993 1996 1997 1998 1999 2000 Time
Memory L1 Instr. L2 Branch misprediction L1 Data Next IP Next IP Fetch Fetch Drive Alloc. Rename Rename Queue Schedule Schedule Schedule Dispatch Dispatch Reg. Read Reg. Read Execute Flags Br. chk Drive Cache memory hierarchies • Cache memory hierarchies • First level (L1) cache built into the processor core • Takes 1-3 processor clock cycles to access • If there is a miss in the L1 cache on-chip L2 cache accessed in the order of 10 processor cycles • Accessing main memory takes at least in the order of 100 processor cycles • Prefetching data from memory to the cache • Prefetch addresses hard to predict
Definition • An out-of-order superscalar processor that supports thousands of “in-flight instructions” • Intelligent use of resources
Scalability • Thousands of In-flight Instructions and In-Order Commit make designs impractical: • ROB : Needs to maintain a copy of every in-flight instruction • IQs : Instructions depending on long latency instructions remain in these queues for a long time • LSQs : Instructions remain in the queue until commit • Registers : A new physical register for each instruction producing a new value • We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements
Efficient Kilo-Instruction Processor Design • Multi-Checkpointing the ROB • Out-of-Order Commit • Early Release of Resources • Ephemeral Registers • Load Queues
Checkpointing • ROB allows of the restoration of the correct state at any instruction (not necessary) • Checkpoint a snapshot of the processor state taken at a specific instruction of the program being executed (checkpoint processor state for a subset of instructions) • With this snapshot the processor can restore state to that point in case of an exception or misprediction
Design Decisions • How many in-flight checkpoints should be maintained by the processor? • large number of checkpoints reduce the penalty of the recovery process • large number of checkpointsincrease the implementation cost • What kind of instructions should be checkpointed? • take acheckpoint at any instruction • some instructions are better candidates (ex:some current processors take checkpoints atbranch instructions in order to minimize the branch misprediction penalty) • How much information should be kept by each checkpoint?
Selective Checkpointing • Replace ROB Pseudo-ROB • Processor removes instructions that reach the pseudo-ROB’s head at fixed rate • Processor state is recovarable for any instruction in the pseudo-ROB • Checkpoint taken when incomplete instruction leaves the pseudo-ROB
Bi-level Issue Queue • Processor detects instructions that will hold an issue queue for a long time • Removes this instructions from primary issue queue • Offloads them to slow-lane instruction queue larger, slower, less complex • Same principle applied to load-store queue
Ephemeral Registers • A conventional superscalar processor assigns registers to architected registers when an instruction enters the issue queue • An instruction reserves a physical register for its entire flight time • A physical register not written a value until much later primary function is tracking data dependencies • Use virtual registers late register allocation • Release register if no other instruction that reads the data early release
References • Adrian Cristal, Oliverio J. Santana, Francisco Cazorla, Marco Galluzzi, Tanausu Ramirez, Miquel Pericas, Mateo Valero. "Kilo-Instruction Processors: Overcoming the Memory Wall," IEEE Micro, vol. 25, no. 3, pp. 48-57, May/June, 2005. • A. Cristal, O. Santana, M. Valero, and J.F. Martínez. Toward kilo-instruction processors. In ACM Trans. on Architecture and Code Optimization, Vol. 1, No. 4, Dec. 2004 • Marco Galluzzi, Valentin Puente, Adrián Cristal, Ramón Beivide, José-Ángel Gregorio, Mateo Valero, A first glance at Kilo-instruction based multiprocessors, Conf. Computing Frontiers 2004: 212-221