CprE 381

CprE 381 Fall 2013 Course Review Zhao Zhang

Performance Summary • Define Performance = 1/Execution Time • “X is n time faster than Y” Chapter 1 — Computer Abstractions and Technology — 2

Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance §1.8 Fallacies and Pitfalls • Improvements shall be balanced • Not a good idea to over-optimize one aspect Chapter 1 — Computer Abstractions and Technology — 3

Op-code address Op-code Op-code rs rs rt rt rd immediate shamt funct 31:26 25:0 31:26 31:26 25:21 25:21 20:16 20:16 15:11 10:6 15:0 5:0 MIPS Instruction Encoding • Three instruction encoding format R-type I-type J-type

Register Name and Call Convention 6 24 6 Chapter 1 — Computer Abstractions and Technology — 5

IEEE Floating-Point Format • S: sign bit (0  non-negative, 1  negative) • Normalize significand: 1.0 ≤ |significand| < 2.0 • Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) • Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias • Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 single: 8 bitsdouble: 11 bits single: 23 bitsdouble: 52 bits S Exponent Fraction Chapter 3 — Arithmetic for Computers — 6

Processor Implementation §4.1 Introduction • We have examined two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq, j Chapter 4 — The Processor — 7

Single-Cycle Processor Chapter 4 — The Processor — 8

Control Signal Setting • What’re the control signal values for each instruction or instruction type? Note: “R-” means R-format Chapter 1 — Computer Abstractions and Technology — 9

ALU Control • Assume 2-bit ALUOp derived from opcode • Combinational logic derives ALU control Chapter 4 — The Processor — 10

MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 11

Pipelined Implementation Chapter 4 — The Processor — 12

Data Hazards from ALU Instructions • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 • Consider this sequence: sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2) Chapter 4 — The Processor — 13

Forwarding Paths Chapter 4 — The Processor — 14

Load-Use Data Hazard • Stall for one cycle whenever a load is followed immediately a instruction that uses the load result Chapter 4 — The Processor — 15

How to Stall the Pipeline • Force control values in ID/EX registerto 0 • Effectively inserts a NOP (bubble) • Prevent register write of PC • Holds the instruction at IF • Prevent register write of IF/ID register • Holds the instruction at ID Chapter 4 — The Processor — 16

Datapath with Hazard Detection Chapter 4 — The Processor — 17

Code Scheduling to Avoid Stalls • Reorder code to avoid load-use stall • Separate the load and “use” by one instruction lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 18

Control Hazards • The IF stage works continuously • Every cycle, it has to fetch an instruction to make the pipeline flow • Branch outcome is known at the MEM stage in the previous design • Flush IF, ID, EX for each taken branch • Optimize: Move branch comparison to ID stage • Flush IF for each taken branch • MIPS ISA only supports BNE, BEQ and branches that compares a register to zero Chapter 4 — The Processor — 19

Branch Comparison at ID Need more forwarding paths to ID stage (not shown) Chapter 4 — The Processor — 20

IF IF IF ID ID ID EX EX EX MEM MEM MEM WB WB WB Data Hazards for Branches • However, more data hazards on branches lw$1, addr add $4, $5, $6 IF ID beq stalled ID EX MEM WB beq$1, $4, target lw$1, addr IF ID beqstalled ID beq stalled ID EX MEM WB beq$1, $0, target Chapter 4 — The Processor — 21

Delayed Branch Delayed branch may remove the one-cycle stall • The instruction right after the beq is executed no matter the branch is taken or not • Alternatingly saying, the execution of beq is delayed by one cycle sub $10, $4, $8beq $1, $3, 7beq $1, $3, 7 => sub $10, $4, $8 and $12, $2, $5 and $12, $2, $5 Must find an independent instruction, otherwise • May have to fill in a nop instruction, or • Need two variants of beq, delayed and not delayed Chapter 1 — Computer Abstractions and Technology — 22

Memory Technology §5.1 Introduction • Static RAM (SRAM) • 0.5ns – 2.5ns, $2000 – $5000 per GB • Dynamic RAM (DRAM) • 50ns – 70ns, $20 – $75 per GB • Magnetic disk • 5ms – 20ms, $0.20 – $2 per GB • Ideal memory • Access time of SRAM • Capacity and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23

Principle of Locality • Programs access a small proportion of their address space at any time • Temporal locality • Items accessed recently are likely to be accessed again soon • e.g., instructions in a loop, induction variables • Spatial locality • Items near those accessed recently are likely to be accessed soon • E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

Example • How many cache misses for a cache of 64-byte block size? extern int X[256]; for (sum = 0, i = 0; i < 256; i++) sum = sum + X[i]; Chapter 1 — Computer Abstractions and Technology — 25

Direct Mapped Cache • Location determined by address • Direct mapped: only one choice • (Block address) modulo (#Blocks in cache) • #Blocks is a power of 2 • Use low-order address bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27

Example: Intrinsity FastMATH Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

Measuring Cache Performance • With simplifying assumptions: §5.3 Measuring and Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

Average Access Time • Average memory access time (AMAT) • AMAT = Hit time + Miss rate × Miss penalty • A simple formula that integrates three major performance factors of cache design • Not always consistent with memory stall cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30

Replacement Policy • Direct mapped: no choice • Set associative • Least-recently used (LRU): Choose the one unused for the longest time • Random: Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31

Cache Write Policy Cache hit policy • Write through: Update cache block and memory memory • Incur very high write traffic (not commonly used) • Write back: Update cache block only • Needs a dirty bit per cache block • Write happens on replacement of the block Cache miss policy • Write allocate: Load the memory block to cache • Write around (non-allocate): Do not load the block to cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32

Multilevel Caches • Primary cache attached to CPU • Level-2 cache services misses from primary cache • Larger, slower, but still faster than main memory • Optional L3 cache • Main memory services L-2 cache misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

Virtual Memory • Map VM pages to PM pages and disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

Translation Using a Page Table Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

Fast Translation Using a TLB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

TLB and Cache Interaction • If cache tag uses physical address • Need to translate before cache lookup • Alternative: use virtual address tag • Complications due to aliasing • Different virtual addresses for shared physical address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

Other Topics • Storage and other I/O will be covered by conceptual questions • Multicore and multiprocessors will not be covered Chapter 1 — Computer Abstractions and Technology — 38

CprE 381

CprE 381

Presentation Transcript

cse 381

Cos 381

VHDL Programming in CprE 381

381 - 787

CprE 583

COS 381

COS 381

ECON 381

APSC 381

Cos 381

COS 381

COS 381

COS 381

Cos 381

CS 381

COS 381

AH 381

COS 381

AGSC - 381

COS 381

AGSC - 381

LSSD #381