1 / 38

Final Exam Review

Get ready for the final exam covering cache design, memory hierarchy, processor speeds, and more. Includes key concepts, performance factors, and practical examples.

brownandrea
Download Presentation

Final Exam Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Final Exam Review

  2. Exam Format • It will cover material after the mid-term (Cache to multiprocessors) • It is similar to the style of mid-term exam • We will have 6-7 questions in the exam • One question: true/false or short questions which covers general topics. • 5-6 other questions require calculation

  3. Memory Systems

  4. Faster Processor Secondary Storage (Disk) Control Main Memory (DRAM) L2 Off-Chip Cache L1 On-Chip Cache Datapath Registers Speed: Size: Cost: Memory Hierarchy - the Big Picture • Problem: memory is too slow and/or too small • Solution: memory hierarchy Larger Capacity Slowest Fastest Biggest Smallest Lowest Highest

  5. Probability of reference Address Space 0 2n - 1 Why Hierarchy Works • The principle of locality • Programs access a relatively small portion of the address space at any instant of time. • Temporal locality: recently accessed instruction/data is likely to be used again • Spatial locality: instruction/data near recently accessed /instruction data is likely to be used soon • Result: the illusion of large, fast memory

  6. Cache Design & Operation Issues • Q1: Where can a block be placed cache? (Block placement strategy & Cache organization) • Fully Associative, Set Associative, Direct Mapped. • Q2: How is a block found if it is in cache? (Block identification) • Tag/Block. • Q3: Which block should be replaced on a miss? (Block replacement) • Random, LRU. • Q4: What happens on a write? (Cache write policy) • Write through, write back.

  7. Q1: Block Placement • Where can block be placed in cache? • In one predetermined place - direct-mapped • Use fragment of address to calculate block location in cache • Compare cache block with tag to test if block present • Anywhere in cache - fully associative • Compare tag to every block in cache • In a limited set of places - set-associative • Use address fragment to calculate set • Place in any block in the set • Compare tag to every block in set • Hybrid of direct mapped and fully associative

  8. Q2: Block Identification • Every cache block has an address tag and index that identifies its location in memory • Hit when tag and index of desired word match(comparison by hardware) • Q: What happens when a cache block is empty?A: Mark this condition with avalid bit Valid Tag/index Data 1 0x00001C0 0xff083c2d

  9. Cache Replacement Policy • Random • Replace a randomly chosen line • LRU (Least Recently Used) • Replace the least recently used line

  10. Write-through Policy 0x1234 0x1234 0x1234 0x5678 0x5678 0x1234 Processor Cache Memory

  11. Write-back Policy 0x1234 0x1234 0x1234 0x5678 0x9ABC 0x5678 0x1234 0x5678 Processor Cache Memory

  12. Cache PerformanceAverage Memory Access Time (AMAT), Memory Stall cycles • The Average Memory Access Time (AMAT): The number of cycles required to complete an average memory access request by the CPU. • Memory stall cycles per memory access: The number of stall cycles added to CPU execution cycles for one memory access. • For an ideal memory: AMAT = 1 cycle, this results in zero memory stall cycles. • Memory stall cycles per average memory access = (AMAT -1) • Memory stall cycles per average instruction = Memory stall cycles per average memory access x Number of memory accesses per instruction = (AMAT -1 ) x ( 1 + fraction of loads/stores) Instruction Fetch

  13. Cache Performance • Unified cache: For a CPU with a single level (L1) of cache for both instructions and data and no stalls for cache hits: CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x Clock cycle time CPU time = IC x [CPI execution + Memory accesses/instruction x Miss rate x Miss penalty ] x Clock cycle time • Split Cache: For a CPU with separate or split level one (L1) caches for instructions and data and no stalls for cache hits: CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x Clock cycle time Mem Stall cycles per instruction = Instruction Fetch Miss rate x Miss Penalty + Data Memory Accesses Per Instruction x Data Miss Rate x Miss Penalty

  14. Memory Access TreeFor Unified Level 1 Cache CPU Memory Access L1 Hit: % = Hit Rate = H1 Access Time = 1 Stalls= H1 x 0 = 0 ( No Stall) L1 Miss: % = (1- Hit rate) = (1-H1) Access time = M + 1 Stall cycles per access = M x (1-H1) L1 AMAT = H1 x 1 + (1 -H1 ) x (M+ 1) = 1 + M x ( 1 -H1) Stall Cycles Per Access = AMAT - 1 = M x (1 -H1) M = Miss Penalty H1 = Level 1 Hit Rate 1- H1 = Level 1 Miss Rate

  15. Memory Access TreeFor Separate Level 1 Caches CPU Memory Access Instruction Data L1 Instruction L1 Hit: Access Time = 1 Stalls = 0 Instruction L1 Miss: Access Time = M + 1 Stalls Per access: %instructions x (1 - Instruction H1 ) x M Data L1 Miss: Access Time : M + 1 Stalls per access: % data x (1 - Data H1 ) x M Data L1 Hit: Access Time: 1 Stalls = 0 Stall Cycles Per Access = % Instructions x ( 1 - Instruction H1 ) x M + % data x (1 - Data H1 ) x M AMAT = 1 + Stall Cycles per access

  16. Cache Performance (various factors) • Cache impact on performance • With and without cache • Processor clock rate • Which one performs better: unified or split • Assuming same size • What is the effect of cache organization on cache performance: 1-way, 8-way set associative • Tradeoffs between hit-time and hit-rate

  17. Cache Performance (various factors) • What is the affect of write policy on cache performance: Write back or write through – write allocate vs. no-write allocate • Stall Cycles Per Memory Access = % reads x (1 - H1 ) x M + % write x M • Stall Cycles Per Memory Access = (1-H1) x ( M x % clean + 2M x % dirty ) • What is the effect of cache levels on performance: • Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1)(1-H2) x M • Stall cycles per memory access = (1-H1) x H2 x T2 + (1-H1) x (1-H2) x H3 x T3 + (1-H1)(1-H2) (1-H3)x M

  18. Performance Equation To reduce CPUtime, we need to reduce Cache Miss Rate

  19. Reducing Misses (3 Cs) • Classifying Cache Misses: 3 Cs • Compulsory—(Misses even in infinite size cache) • Capacity—(Misses due to size of cache) • Conflict—(Misses due to associative and size of cache) • How to reduce the 3 Cs (Miss rate) • Increase Block Size • Increase Associativity • Use a Victim Cache • Use a Pseudo Associative Cache • Use a prefetching technique

  20. Performance Equation To reduce CPUtime, we need to reduce Cache Miss Penalty

  21. 4 bytes 4 bytes CPU CPU Bus Bus Cache Cache Bus 1 1 1 1 25 25 25 25 1 1 1 1 Memory Bus Bus Bus Bus Memory3 Memory2 Memory1 Memory0 Memory Interleaving – Reduce miss penalty Interleaving Default Begin accessing one word, and while waiting, start accessing other three words (pipelining) Must finish accessing one word before starting the next access (1+25+1)x4 = 108 cycles 30 cycles Requires 4 separate memories, each 1/4 size Interleaving worksperfectly with caches Spread out addresses among the memories

  22. Memory Interleaving: An Example Given the following system parameters with single cache level L1: Block size=1 word Memory bus width=1 word Miss rate =3% Miss penalty=27 cycles (1 cycles to send address 25 cycles access time/word, 1 cycles to send a word) Memory access/instruction = 1.2 Ideal CPI (ignoring cache misses) = 2 Miss rate (block size=2 word)=2% Miss rate (block size=4 words) =1% • The CPI of the base machine with 1-word blocks = 2+(1.2 x 0.03 x 27) = 2.97 • Increasing the block size to two words gives the following CPI: • 32-bit bus and memory, no interleaving = 2 + (1.2 x .02 x 2 x 27) = 3.29 • 32-bit bus and memory,interleaved = 2 + (1.2 x .02 x (28)) = 2.67 • Increasing the block size to four words; resulting CPI: • 32-bit bus and memory, no interleaving = 2 + (1.2 x 0.01 x 4 x 27) = 3.29 • 32-bit bus and memory,interleaved = 2 + (1.2 x 0.01 x (30)) = 2.36

  23. Cache vs. Virtual Memory • Motivation for virtual memory (Physical memory size, multiprogramming) • Concept behind VM is almost identical to concept behind cache. • But different terminology! • Cache: Block VM: Page • Cache: Cache Miss VM: Page Fault • Caches implemented completely in hardware. VM implemented in software, with hardware support from CPU. • Cache speeds up main memory access, while main memory speeds up VM access • Translation Look-Aside Buffer (TLB) • How to calculate the size of page tables for a given memory system • How to calculate the size of pages given the size of page table

  24. Virtual Memory: Definitions • Key idea: simulate a larger physical memory than is actually available • General approach: • Break address space up into pages • Each program accesses a working set of pages • Store pages: • In physical memory as space permits • On disk when no space left in physical memory • Access pages using virtual address Individual Pages Memory Map Disk Physical Memory Virtual Memory

  25. I/O Systems

  26. I/O Systems

  27. I/O concepts • Disk Performance • Disk latency = average seek time + average rotational delay + transfer time + controller overhead • Interrupt-driven I/O • Memory-mapped I/O • I/O channels: • DMA (Direct Memory Access) • I/O Communication protocols • Daisy chaining • Polling • I/O Buses • Synchronous vs. asynchronous

  28. RAID Systems • Examined various RAID architectures: RAID0-RAID5: Cost, Performance (BW, I/O request rate) • RAID-0: No redundancy • RAID-1: Mirroring • RAID-2: Memory-style ECC • RAID-3: bit-interleaved parity • RAID-4: block-interleaved parity • RAID-5: block-interleaved distributed parity

  29. Storage Architectures • Examined various Storage architectures (Pros. And Cons): • DAS - Directly-Attached Storage • NAS - Network Attached Storage • SAN - Storage Area Network

  30. Multiprocessors

  31. Motivation • Application needs • Amdhal’s law • T(n) = • As n  , T(n)  • Gustafson’s law • T'(n) = s + n*p; T'() !!!! 1 s+p/n 1 s

  32. Flynn’s Taxonomy of Computing • SISD (Single Instruction, Single Data): • Typical uniprocessor systems that we’ve studied throughout this course. • SIMD (Single Instruction, Multiple Data): • Multiple processors simultaneously executing the same instruction on different data. • Specialized applications (e.g., image processing). • MIMD (Multiple Instruction, Multiple Data): • Multiple processors autonomously executing different instructions on different data.

  33. MB MB P/C P/C Cache Cache NIC NIC Bus/Custom-Designed Network Shared Memory Multiprocessors Shared Memory

  34. MPP (Massively Parallel Processing)Distributed Memory Multiprocessors MB : Memory Bus NIC : Network Interface Circuitry MB MB P/C P/C LM LM NIC NIC Custom-Designed Network

  35. Cluster LD : Local Disk IOB : I/O Bus MB MB P/C P/C M M Bridge Bridge LD LD IOB IOB NIC NIC Commodity Network (Ethernet, ATM, Myrinet)

  36. Grid P/C P/C P/C P/C IOC IOC Hub/LAN Hub/LAN NIC NIC LD LD SM SM SM SM Internet

  37. Multiprocessor concepts • SIMD Applications (Image processing) • MIMD • Shared memory • Cache coherence problems • Bus scalability problems • Distributed memory • Interconnection networks • Cluster of workstations

  38. Preparation Strategy • Read this review to focus your preparation • 1 general question • 5-6 other questions • Around 50% for memory systems • Around 50% I/O and multiprocessors • Go through the lecture notes • Go through the “training problems” • We will have more office hours for help • Good luck

More Related