ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2014 Instruction-Level Parallelism

ELEC 5200-001/6200-001Computer Architecture and DesignSpring 2014 Instruction-Level Parallelism Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu ELEC 5200-001/6200-001 Lecture 12

A Computer System Processor Interrupts Cache Memory – I/O bus I/O controller I/O controller I/O controller Main memory Disk Disk Graphics output Network ELEC 5200-001/6200-001 Lecture 12

Advanced Architectures – ILP • Instruction level parallelism (ILP): multiple instructions fetched and executed simultaneously. • ILP is used in addition to pipelining. • Processors with ILP are called multiple-issue processors – multiple instructions launched in 1 clock cycle. Two ways: • MIMD: Multiple Instructions Multiple Data • Superpipeline • Superscalar – dynamic multiple issue • Very long instruction word (VLIW) – static multiple issue • SIMD: Single Instruction Multiple Data • Vector processor ELEC 5200-001/6200-001 Lecture 12

IF IF IF IF IF IF IF IF IF IF IF IF ID ID ID ID ID ID ID ID ID ID ID ID EX EX EX EX EX EX EX EX EX EX EX EX MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM WB WB WB WB WB WB WB WB WB WB WB WB Superpipeline and Superscalar Pipeline 1 instruction/cycle Superpipeline (Pipeline clock is twice as fast as the system clock) 2 instructions per cycle Superscalar 2 (or more) instructions/cycle System clock cycles 0 1 2 3 4 5 6 7 8 ELEC 5200-001/6200-001 Lecture 12

Superscalar: Dynamic Scheduling and Out-of-Order Execution Instruction fetch and decode unit In-order issue Out-of-order issue Reservation station Reservation station Reservation station Reservation station Functional units integer integer Floating point Load/ store Out-of-order execution Commit unit In-order commit ELEC 5200-001/6200-001 Lecture 12

Superscalar Instruction Issue • Rules: • RAW dependence – If any operand is being written, do not issue. • WAR dependence – If the result register is being read, do not issue. • WAW dependence – If the result register is being written, do not issue. • Scoreboard: • Cycle by cycle record of registers and execution units showing how many instructions are using them. • Example 1: In-order issue (next 2 slides). • Example 2: Out-of-order issue (3rd slide). ELEC 5200-001/6200-001 Lecture 12

Example • Consider an example: • First with in-order issue • Then with out-of-order issue • Assume: • Up to two instructions are fetched in a cycle • Instruction register can hold two instructions • An Instruction is issued in decode cycle, or must wait until there is no RAW, WAR or WAW dependence • An instruction can retire two or three cycles after it is issued ELEC 5200-001/6200-001 Lecture 12

ELEC 5200-001/6200-001 Lecture 12

In-order Issue scoreboard (Continued) Out-of-order scoreboard (Next 2 Slides) ELEC 5200-001/6200-001 Lecture 12

Questions? • RAW dependence: Inst# 4 (R6 = R1 + R4) could not be issued until cycle 5. Should Inst# 5 (R7 = R1 * R2) wait in queue? • Answer: No. Inst# 5 can be issued in cycle 3 as there is no register conflict (out-of-order issue). • WAR dependence: Must the issue of Inst#6 (R1 = R0 – R2) waits until cycle 9 when all instructions reading R1 have retired? • Answer: No. Provided new result of Inst#6 does not affect R1 being used by previous instructions (register renaming). ELEC 5200-001/6200-001 Lecture 12

ELEC 5200-001/6200-001 Lecture 12

References • Previous example is from: • A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Prentice-Hall, 2006, pp. 304-309, Section 4.5.3. • Further reading: • D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. & Dev., vol. 11, no. 1, pp. 8-24, Jan. 1967. ELEC 5200-001/6200-001 Lecture 12

Power Reduction by Slack Scheduling • Application: Superscalar, out-of-order execution: • An instruction is executed as soon as the required data and resources become available. • A commit unit reorders the results. • Delay the completion of instructions whose result is not immediately needed. • Example of RISC instructions: • add r0, r1, r2; (A) • sub r3, r4, r5; (B) • and r9, r1, r9; (C) • or r5, r9, r10; (D) • xor r2, r10, r11; (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000. ELEC 5200-001/6200-001 Lecture 12

Slack Scheduling Example ELEC 5200-001/6200-001 Lecture 12

Slack Scheduling Re-order buffer Scheduling logic Low-power execution units (Reduced voltage) Slack bit ELEC 5200-001/6200-001 Lecture 12

Superscalar Design of P4 (CISC) • CISC shell: • Processor fetches instructions from memory in the order of static program. • Each instruction is translated into one or more fixed-length RISC instructions, known as micro-operations (micro-ops). • RISC core: • Micro-ops are executed out-of-order in a dynamically scheduled pipeline. • Processor commits the result of each micro-op execution to register file in the order of original program flow. ELEC 5200-001/6200-001 Lecture 12

Superscalars • 3 or more instruction issues per clock: • Intel P6 • AMD K5 • Sun UltraSPARC • Alpha 21164 • MIPS R10000 • PowerPC 604/620 • HP 8000 • References: • D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. Dev., vol. 11, pp. 8-24, January 1967. • T. Agerwala and J. Cocke, “Reduced Instruction Set Processors,” Technical Report RC12434 (#55845), Yorktown Heights, NY: IBM T. J. Watson Research Center, January 1987. ELEC 5200-001/6200-001 Lecture 12

VLIW: Very Long Instruction Word • Static multiple issue, ILP determined by compiler. • Datapath contains multiple execution units. • Compiler groups instructions that have no data or resource conflicts for parallel execution. • Grouped instructions are packed in very long words of a wide instruction memory. • Speedup benefit of VLIW is highly program dependent. • Ref: J. A. Fisher, “Very Long Instruction Word Architecture and ELI-512,” Proc. 10th Symp. on Computer Architecture, Stockholm, June 1983, pp. 478-490. ELEC 5200-001/6200-001 Lecture 12

Topics in Computer Architecture • Instruction set • Program execution through register transfer • See Lectures 13-14. Computer arithmetic (2’s complement, IEEE 754 floating point standard, addition, multiplication) • Datapaths (single-cycle, multicycle, pipeline) • Control (combinational logic, FSM, microcode) • Pipelining (throughput, hazards, forwarding, stall, branch prediction) • Memory organization (cache, virtual memory) • Performance (benchmarks, energy efficiency, Amdal’s law) • Advanced architectures (ILP, OOE, superscalar, etc.) • Not discussed in this course: • Multiprocessors • Compiler and software techniques – loop unrolling, trace execution, etc. • Input and output • Power management ELEC 5200-001/6200-001 Lecture 12

One who claims to know much about computer architecture speaks from ignorance . . . because a lot is going to happen in the future, which is . . . http://www.youtube.com/watch?v=xZbKHDPPrrc ELEC 5200-001/6200-001 Lecture 12

ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2014 Instruction-Level Parallelism