1 / 38

CprE 381

CprE 381. Fall 2013 Course Review Zhao Zhang. Performance Summary. Define Performance = 1/Execution Time “ X is n time faster than Y ”. Amdahl ’ s Law. Improving an aspect of a computer and expecting a proportional improvement in overall performance. §1.8 Fallacies and Pitfalls.

alaula
Download Presentation

CprE 381

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE 381 Fall 2013 Course Review Zhao Zhang

  2. Performance Summary • Define Performance = 1/Execution Time • “X is n time faster than Y” Chapter 1 — Computer Abstractions and Technology — 2

  3. Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance §1.8 Fallacies and Pitfalls • Improvements shall be balanced • Not a good idea to over-optimize one aspect Chapter 1 — Computer Abstractions and Technology — 3

  4. Op-code address Op-code Op-code rs rs rt rt rd immediate shamt funct 31:26 25:0 31:26 31:26 25:21 25:21 20:16 20:16 15:11 10:6 15:0 5:0 MIPS Instruction Encoding • Three instruction encoding format R-type I-type J-type

  5. Register Name and Call Convention 6 24 6 Chapter 1 — Computer Abstractions and Technology — 5

  6. IEEE Floating-Point Format • S: sign bit (0  non-negative, 1  negative) • Normalize significand: 1.0 ≤ |significand| < 2.0 • Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) • Significand is Fraction with the “1.” restored • Exponent: excess representation: actual exponent + Bias • Ensures exponent is unsigned • Single: Bias = 127; Double: Bias = 1203 single: 8 bitsdouble: 11 bits single: 23 bitsdouble: 52 bits S Exponent Fraction Chapter 3 — Arithmetic for Computers — 6

  7. Processor Implementation §4.1 Introduction • We have examined two MIPS implementations • A simplified version • A more realistic pipelined version • Simple subset, shows most aspects • Memory reference: lw, sw • Arithmetic/logical: add, sub, and, or, slt • Control transfer: beq, j Chapter 4 — The Processor — 7

  8. Single-Cycle Processor Chapter 4 — The Processor — 8

  9. Control Signal Setting • What’re the control signal values for each instruction or instruction type? Note: “R-” means R-format Chapter 1 — Computer Abstractions and Technology — 9

  10. ALU Control • Assume 2-bit ALUOp derived from opcode • Combinational logic derives ALU control Chapter 4 — The Processor — 10

  11. MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 11

  12. Pipelined Implementation Chapter 4 — The Processor — 12

  13. Data Hazards from ALU Instructions • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 • Consider this sequence: sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2) Chapter 4 — The Processor — 13

  14. Forwarding Paths Chapter 4 — The Processor — 14

  15. Load-Use Data Hazard • Stall for one cycle whenever a load is followed immediately a instruction that uses the load result Chapter 4 — The Processor — 15

  16. How to Stall the Pipeline • Force control values in ID/EX registerto 0 • Effectively inserts a NOP (bubble) • Prevent register write of PC • Holds the instruction at IF • Prevent register write of IF/ID register • Holds the instruction at ID Chapter 4 — The Processor — 16

  17. Datapath with Hazard Detection Chapter 4 — The Processor — 17

  18. Code Scheduling to Avoid Stalls • Reorder code to avoid load-use stall • Separate the load and “use” by one instruction lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles 11 cycles Chapter 4 — The Processor — 18

  19. Control Hazards • The IF stage works continuously • Every cycle, it has to fetch an instruction to make the pipeline flow • Branch outcome is known at the MEM stage in the previous design • Flush IF, ID, EX for each taken branch • Optimize: Move branch comparison to ID stage • Flush IF for each taken branch • MIPS ISA only supports BNE, BEQ and branches that compares a register to zero Chapter 4 — The Processor — 19

  20. Branch Comparison at ID Need more forwarding paths to ID stage (not shown) Chapter 4 — The Processor — 20

  21. IF IF IF ID ID ID EX EX EX MEM MEM MEM WB WB WB Data Hazards for Branches • However, more data hazards on branches lw$1, addr add $4, $5, $6 IF ID beq stalled ID EX MEM WB beq$1, $4, target lw$1, addr IF ID beqstalled ID beq stalled ID EX MEM WB beq$1, $0, target Chapter 4 — The Processor — 21

  22. Delayed Branch Delayed branch may remove the one-cycle stall • The instruction right after the beq is executed no matter the branch is taken or not • Alternatingly saying, the execution of beq is delayed by one cycle sub $10, $4, $8beq $1, $3, 7beq $1, $3, 7 => sub $10, $4, $8 and $12, $2, $5 and $12, $2, $5 Must find an independent instruction, otherwise • May have to fill in a nop instruction, or • Need two variants of beq, delayed and not delayed Chapter 1 — Computer Abstractions and Technology — 22

  23. Memory Technology §5.1 Introduction • Static RAM (SRAM) • 0.5ns – 2.5ns, $2000 – $5000 per GB • Dynamic RAM (DRAM) • 50ns – 70ns, $20 – $75 per GB • Magnetic disk • 5ms – 20ms, $0.20 – $2 per GB • Ideal memory • Access time of SRAM • Capacity and cost/GB of disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 23

  24. Principle of Locality • Programs access a small proportion of their address space at any time • Temporal locality • Items accessed recently are likely to be accessed again soon • e.g., instructions in a loop, induction variables • Spatial locality • Items near those accessed recently are likely to be accessed soon • E.g., sequential instruction access, array data Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 24

  25. Example • How many cache misses for a cache of 64-byte block size? extern int X[256]; for (sum = 0, i = 0; i < 256; i++) sum = sum + X[i]; Chapter 1 — Computer Abstractions and Technology — 25

  26. Direct Mapped Cache • Location determined by address • Direct mapped: only one choice • (Block address) modulo (#Blocks in cache) • #Blocks is a power of 2 • Use low-order address bits Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 26

  27. Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 27

  28. Example: Intrinsity FastMATH Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 28

  29. Measuring Cache Performance • With simplifying assumptions: §5.3 Measuring and Improving Cache Performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

  30. Average Access Time • Average memory access time (AMAT) • AMAT = Hit time + Miss rate × Miss penalty • A simple formula that integrates three major performance factors of cache design • Not always consistent with memory stall cycles Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30

  31. Replacement Policy • Direct mapped: no choice • Set associative • Least-recently used (LRU): Choose the one unused for the longest time • Random: Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31

  32. Cache Write Policy Cache hit policy • Write through: Update cache block and memory memory • Incur very high write traffic (not commonly used) • Write back: Update cache block only • Needs a dirty bit per cache block • Write happens on replacement of the block Cache miss policy • Write allocate: Load the memory block to cache • Write around (non-allocate): Do not load the block to cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32

  33. Multilevel Caches • Primary cache attached to CPU • Level-2 cache services misses from primary cache • Larger, slower, but still faster than main memory • Optional L3 cache • Main memory services L-2 cache misses Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

  34. Virtual Memory • Map VM pages to PM pages and disk Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

  35. Translation Using a Page Table Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

  36. Fast Translation Using a TLB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

  37. TLB and Cache Interaction • If cache tag uses physical address • Need to translate before cache lookup • Alternative: use virtual address tag • Complications due to aliasing • Different virtual addresses for shared physical address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

  38. Other Topics • Storage and other I/O will be covered by conceptual questions • Multicore and multiprocessors will not be covered Chapter 1 — Computer Abstractions and Technology — 38

More Related