Chapter 4

Chapter 4 The Processor

A Basic MIPS Implementation • The memory-reference instructions • load word (lw) and store word (sw) • • The arithmetic-logical instructions • add, sub, AND, OR, and slt • • The instructions branch • equal (beq) and jump (j) Memory PC Completing the action Read registers

CPU Overview Chapter 4 — The Processor — 3

Multiplexers • Can’t just join wires together • Use multiplexers Chapter 4 — The Processor — 4

An overview of implementation

Logic Design Conventions • Datapath elements • Combinational elements • Outputs depend only on Inputs • State elements (sequential) • Tow input • One output • Signal • asserted • deasserted

A Y B A A Mux I0 Y + Y Y I1 ALU B B S F Combinational Elements • AND-gate • Y = A & B • Adder • Y = A + B • Arithmetic/Logic Unit • Y = F(A, B) • Multiplexer • Y = S ? I1 : I0 Chapter 4 — The Processor — 8

D Q Clk Clk D Q Sequential Elements • Register: stores data in a circuit • Uses a clock signal to determine when to update the stored value • Edge-triggered: update when Clk changes from 0 to 1 Chapter 4 — The Processor — 9

Clk D Q Write Write D Clk Q Sequential Elements • Register with write control • Only updates on clock edge when write control input is 1 • Used when stored value is required later Chapter 4 — The Processor — 10

Logic Design Conventions • Clocking methodology • edge triggered clocking

Building datapath • What is the major components? • PC • need to be increment by 4 • Instructions memory • Program counter register

Building datapath • Register file (R format) • Read two of them • Two addresses • Two outputs • Write one of them • One Address • One input data • Write control signal

Building datapath • Data memory • Example: lw $t1,offset_value($t2)sw $t1,offset_value($t2) • Read from -- save to • (ALU & register bank) • sign – extend • data memory (read & write) • Address – read data - write data – control signal

Building datapath • Example: beq $t1,$t2,offset • branch target address • PC+4 • Offset field ->shifted by 2 • Sign extended • branch taken or not Jump instruction 28 bit of pc replased with 26 bit

R-Type/Load/Store Datapath Chapter 4 — The Processor — 16

Building datapath

A simple Implementation Scheme • For load and store : add • For R-type instructions: AND,OR,subt,add,slt • For branch equal: subt • ALUOp

A simple Implementation Scheme

Designing the Main Control Unit

Designing the Main Control Unit 4 2 5 3 7 1 6

Designing the Main Control Unit

The setting control lines

Operation of the Datapath (R-Format) Add $t1,$t2,$t3

Operation of the Datapath (load) lw $t1, offset($t2)

Operation of the Datapath (branching) beq $t1, $t2, offset

Finalizing control

Implementing Jumps

Why a single Cycle Implementation is not uses today

Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory  register file  ALU  data memory  register file • Not feasible to vary period for different instructions • Violates design principle • Making the common case fast • We will improve performance by pipelining Chapter 4 — The Processor — 34

An Overview of Pipelining • Laundry • 1. place one dirty load of clothes in the washer. • 2. place the wet load in the dryer. • 3. place the dry load on a table and fold. • 4. ask your roommate to put the clothes away.

datapath

An Overview of Pipelining • Pipelining improves throughput of our laundry system • the speed-up due to pipelining is equal to the number of stages in the pipeline (if) • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register

Pipeline Performance • Assume time for stages is • 100ps for register read or write • 200ps for other stages • Compare pipelined datapath with single-cycle datapath Chapter 4 — The Processor — 39

Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 40

An Overview of Pipelining • For last exampe • Nonpiplined take 2400ps • Piplined take 1400ps • For more instrauction (1,000,003) • Nonpiplined take 1,000,000*800+2400= 800,002,400ps • Piplined take 1,000,000*200+1400ps= 200,001,400ps

An Overview of Pipelining • Pipelining improves performance by increasing instruction throughput, • as opposed to decreasing the execution time of an individual instruction, • instruction throughput is the important metric because real programs execute billions of instructions.

Designing Instruction Sets for Pipelining • In MIPS, All instructions are the same length. (VS. x86) • It makes easier to fetch and to decode • All MIPS instructions are symmetry • Register fields being located in the same place. • Memory Operand only appear in load and store • Using the execution and memory access in the same time • Operands must be aligned in memory • In one data memory accesses data be transferred

Structure Hazards • Conflict for use of a resource • In MIPS pipeline with a single memory • Load/store requires data access • Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” • Hence, pipelined datapaths require separate instruction/data memories • Or separate instruction/data caches Chapter 4 — The Processor — 44

Data Hazards • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 Chapter 4 — The Processor — 45

Forwarding (Bypassing) • Use result when it is computed • Don’t wait for it to be stored in a register • Requires extra connections in the datapath Chapter 4 — The Processor — 46

Load-Use Data Hazard • Can’t always avoid stalls by forwarding • If value not computed when needed • Can’t forward backward in time! Chapter 4 — The Processor — 47

Reordering Code to Avoid Pipeline Stalls • Consider the following code segment in C: • a = b + e; • c = b + f; Find the hazards in the following code segment

Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; lw $t1, 0($t0) lw$t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw$t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 49

Control Hazards • Branch determines flow of control • Fetching next instruction depends on branch outcome • Pipeline can’t always fetch correct instruction • Still working on ID stage of branch • In MIPS pipeline • Need to compare registers and compute target early in the pipeline • Add hardware to do it in ID stage Chapter 4 — The Processor — 50

Chapter 4

Chapter 4

Presentation Transcript

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4

Chapter 4-4

Chapter 4

Chapter 4

Chapter 4 - 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

CHAPTER 4

Chapter 4

Chapter 4

Chapter 4