380 likes | 554 Views
CprE 381. Fall 2013 Course Review Zhao Zhang. Performance Summary. Define Performance = 1/Execution Time “ X is n time faster than Y ”. Amdahl ’ s Law. Improving an aspect of a computer and expecting a proportional improvement in overall performance. §1.8 Fallacies and Pitfalls.
E N D
CprE 381 Fall 2013 Course Review Zhao Zhang
Performance Summary • Define Performance = 1/Execution Time • “X is n time faster than Y” Chapter 1 — Computer Abstractions and Technology — 2
Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance §1.8 Fallacies and Pitfalls • Improvements shall be balanced • Not a good idea to over-optimize one aspect Chapter 1 — Computer Abstractions and Technology — 3
Op-code address Op-code Op-code rs rs rt rt rd immediate shamt funct 31:26 25:0 31:26 31:26 25:21 25:21 20:16 20:16 15:11 10:6 15:0 5:0 MIPS Instruction Encoding • Three instruction encoding format R-type I-type J-type
Register Name and Call Convention 6 24 6 Chapter 1 — Computer Abstractions and Technology — 5
IEEE Floating-Point Format • S: sign bit (0 non-negative, 1 negative) • Normalize significand: 1.0 ≤ |significand| < 2.0 • Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) • Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias • Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 single: 8 bitsdouble: 11 bits single: 23 bitsdouble: 52 bits S Exponent Fraction Chapter 3 — Arithmetic for Computers — 6
Processor Implementation §4.1 Introduction • We have examined two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq, j Chapter 4 — The Processor — 7
Single-Cycle Processor Chapter 4 — The Processor — 8
Control Signal Setting • What’re the control signal values for each instruction or instruction type? Note: “R-” means R-format Chapter 1 — Computer Abstractions and Technology — 9
ALU Control • Assume 2-bit ALUOp derived from opcode • Combinational logic derives ALU control Chapter 4 — The Processor — 10
MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 11
Pipelined Implementation Chapter 4 — The Processor — 12
Data Hazards from ALU Instructions • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 • Consider this sequence: sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2) Chapter 4 — The Processor — 13
Forwarding Paths Chapter 4 — The Processor — 14
Load-Use Data Hazard • Stall for one cycle whenever a load is followed immediately a instruction that uses the load result Chapter 4 — The Processor — 15
How to Stall the Pipeline • Force control values in ID/EX registerto 0 • Effectively inserts a NOP (bubble) • Prevent register write of PC • Holds the instruction at IF • Prevent register write of IF/ID register • Holds the instruction at ID Chapter 4 — The Processor — 16
Datapath with Hazard Detection Chapter 4 — The Processor — 17
Code Scheduling to Avoid Stalls • Reorder code to avoid load-use stall • Separate the load and “use” by one instruction lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 18
Control Hazards • The IF stage works continuously • Every cycle, it has to fetch an instruction to make the pipeline flow • Branch outcome is known at the MEM stage in the previous design • Flush IF, ID, EX for each taken branch • Optimize: Move branch comparison to ID stage • Flush IF for each taken branch • MIPS ISA only supports BNE, BEQ and branches that compares a register to zero Chapter 4 — The Processor — 19
Branch Comparison at ID Need more forwarding paths to ID stage (not shown) Chapter 4 — The Processor — 20
IF IF IF ID ID ID EX EX EX MEM MEM MEM WB WB WB Data Hazards for Branches • However, more data hazards on branches lw$1, addr add $4, $5, $6 IF ID beq stalled ID EX MEM WB beq$1, $4, target lw$1, addr IF ID beqstalled ID beq stalled ID EX MEM WB beq$1, $0, target Chapter 4 — The Processor — 21
Delayed Branch Delayed branch may remove the one-cycle stall • The instruction right after the beq is executed no matter the branch is taken or not • Alternatingly saying, the execution of beq is delayed by one cycle sub $10, $4, $8beq $1, $3, 7beq $1, $3, 7 => sub $10, $4, $8 and $12, $2, $5 and $12, $2, $5 Must find an independent instruction, otherwise • May have to fill in a nop instruction, or • Need two variants of beq, delayed and not delayed Chapter 1 — Computer Abstractions and Technology — 22
Memory Technology §5.1 Introduction • Static RAM (SRAM) • 0.5ns – 2.5ns, $2000 – $5000 per GB • Dynamic RAM (DRAM) • 50ns – 70ns, $20 – $75 per GB • Magnetic disk • 5ms – 20ms, $0.20 – $2 per GB • Ideal memory • Access time of SRAM • Capacity and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23
Principle of Locality • Programs access a small proportion of their address space at any time • Temporal locality • Items accessed recently are likely to be accessed again soon • e.g., instructions in a loop, induction variables • Spatial locality • Items near those accessed recently are likely to be accessed soon • E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24
Example • How many cache misses for a cache of 64-byte block size? extern int X[256]; for (sum = 0, i = 0; i < 256; i++) sum = sum + X[i]; Chapter 1 — Computer Abstractions and Technology — 25
Direct Mapped Cache • Location determined by address • Direct mapped: only one choice • (Block address) modulo (#Blocks in cache) • #Blocks is a power of 2 • Use low-order address bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26
Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27
Example: Intrinsity FastMATH Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28
Measuring Cache Performance • With simplifying assumptions: §5.3 Measuring and Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29
Average Access Time • Average memory access time (AMAT) • AMAT = Hit time + Miss rate × Miss penalty • A simple formula that integrates three major performance factors of cache design • Not always consistent with memory stall cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30
Replacement Policy • Direct mapped: no choice • Set associative • Least-recently used (LRU): Choose the one unused for the longest time • Random: Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31
Cache Write Policy Cache hit policy • Write through: Update cache block and memory memory • Incur very high write traffic (not commonly used) • Write back: Update cache block only • Needs a dirty bit per cache block • Write happens on replacement of the block Cache miss policy • Write allocate: Load the memory block to cache • Write around (non-allocate): Do not load the block to cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32
Multilevel Caches • Primary cache attached to CPU • Level-2 cache services misses from primary cache • Larger, slower, but still faster than main memory • Optional L3 cache • Main memory services L-2 cache misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33
Virtual Memory • Map VM pages to PM pages and disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34
Translation Using a Page Table Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35
Fast Translation Using a TLB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36
TLB and Cache Interaction • If cache tag uses physical address • Need to translate before cache lookup • Alternative: use virtual address tag • Complications due to aliasing • Different virtual addresses for shared physical address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37
Other Topics • Storage and other I/O will be covered by conceptual questions • Multicore and multiprocessors will not be covered Chapter 1 — Computer Abstractions and Technology — 38