Memory Hierarchy

Memory Hierarchy Faster Access-Lower Cost

Principle of Locality • Programs access small portions of their address space at any instant of time. • Two types • Temporal locality • Item referenced will be referenced again soon • Spatial locality • Items near the last referenced item will be referenced soon

Memory Hierarchy • Takes advantage of principle of locality • Memory technologies • SRAM – fast but costly • DRAM – slower but not as costly • Magnetic Disk – much slower but very cheap • Idea: construct hierarchy of these memories in increasing size away form processor

Cache Memory (Two level) • Block – Smallest unit of data transferred. • Hit rate – Fraction of memory access found in cache • Miss rate (1 – hit rate) • Hit time – Time to access a level of memory including determine hit or miss • Miss penalty – Time required to fetch block from lower memory Processor Cache

Direct Mapped Cache • How do you map a block of memory from larger memory space to cache? • Simplest method: assign location in cache for each location in memory • Function: • (block addr) mod (# cache blocks) • If # cache blocks is 2n, block address for memory address A is log2(A) • Note this is just the lower n bits of A

A Direct-Mapped Cache Example

Accessing A Cache References: 10110 - m 11010 - m 10110 - h 11010 - h 10000 - m 00011 - m 10000 - h 10010 - m

Updated Cache References: 10110 - m 11010 - m 10110 - h 11010 - h 10000 - m 00011 - m 10000 - h 10010 - m

Selecting the Data

Handling Cache Misses • Must modify control to take into account if miss occurs • Consider instruction memory • Algorithm • Send (PC – 4) to memory • Read memory and wait for result • Write cache entry • Restart instruction execution

Handling Writes • Want to avoid inconsistent cache and memory • Two approaches • Write-through • Write-back

Write-Through • Idea: Write data into both cache and memory • Simple solution • Problematic in that the write to memory will take longer than write to cache (maybe 100 times longer) • Can use a write buffer • What problems arise from using a write buffer?

Write-Back • Write only to the cache • Mark cache blocks that have been written to as “dirty” • If block is dirty it must be written to memory when it is replaced • What type of problems can arise using this strategy?

Memory Design to Support Caches • Assume: • 1 memory bus clock cycle to send addr. • 15 memory bus clock cycles for DRAM access • 1 memory bus clock cycle to send on word of data • 4 word block transfer • 1 + 4x15 + 4x1 = 65 bus clock cycles • Miss penalty is high • Bytes transferred per clock cycle (4x4)/65=0.25

Memory Designs How do designs b & c increase the bytes per clock cycle transfer rate?

Bits In Cache • Block size is larger than word – say 2m words • Cache has 2n blocks • Tag bits: 32 – (n – m + 2) • Size: 2nx(mx32 + (32–n–m-2) + 1)

Analysis of Block Size • Larger blocks exploit spatial locality • Therefore, miss rate is lowered • What happens as block size continues to gets larger? • Cache size is fixed • Number of cache blocks is reduced • Contention for block space in cache increases • Miss rate goes up

Measuring Cache Performance • CPU Time = (CPU exe cycles + mem stall cycles) x Clock cycle time • Read-stall cycles = Reads/Program x Read miss rate x Read Miss Penalty • Writes are a problem because of buffer stalls

Measuring Cache Performance:Simplifications • Assume write-through scheme • Assume well designed system so that the write buffer stalls can be ignored • Read and write miss penalties are the same. • Memory-stall clock cycles = Instructions/Program x misses/Instruction x Miss penalty

Example • Assume • Instruction cache miss rate: 2% • Data cache miss rate: 4% • CPI (Cycles Per Instruction): 2 • Miss penalty: 100 cc • SPECint2000 benchmark: 36% load & store Instructions • Clock Cycle Time: 1 ns (1x10-9 sec) • Find the CPU execution time • How much faster would perfect cache be?

Solution • Instruction miss cycles: I x 2% x 100 = 2I • Data miss cycles: I x 36% x 4% x 100 = 1.44I • Memory Stall cycles: 2I + 1.44I = 3.44I • CPI (w. Memory Stalls): 2 + 3.44 = 5.44 • CPU execution time = 5.44I x 1 ns • Perfect cache is 5.44/2 = 2.72 times faster

Types of Cache Mappings • Direct mapped • Each block in only one place • (block number) mod (# cache blocks) • Set Associative • Each block can be mapped to n places in cache • (block number) mod (# sets in cache) • Fully Associative • Block can map anywhere in cache

Types of Cache Mappings (2)

Set Associative Cache Mappings

Locating the Block in Cache

Virtual Memory: The Concept • Use main memory as a cache for magnetic disk • Motivations • Safe and efficient sharing of main memory • Remove programmer burden of handling small limited amounts of memory • Invented in the 1960s

Virtual Memory: Sharing Memory • Programs must be well behaved • Main concept: each pgm has it own address space • Virtual memory: addr in program → physical address • Protection • Protect one process from another • Set of mechanisms for ensuring this

Virtual Memory: Small Memories • W/o virtual memory programmer must make large pgm fit in small memory space • Solution was the use of overlays • Even with our relatively large main memories, we would still have to do this today w/o virtual memory!!!!

Virtual Memory: Terminology • Page – term for cache block • Page Fault – term for cache miss • Virtual address • Address within the program space • Translated to physical address by combination of hardware & software • Process called address translation

Virtual Memory: Conceptual Diagram

Virtual Memory: Address Translation

Virtual Memory: Page Faults • Main memory approx. 100,000 times faster than disk • Page Fault is enormously costly • Key decisions: • Page size – 4KB to 16KB • Reducing page faults attractive • Page Faults can be handled in software • Only write-back can be used

Virtual Memory: Placing & Finding a Page Each process has its own page table

Virtual Memory: Swap Space Swap Space

Virtual Memory: Translation-Lookaside Buffer

Memory Hierarchy