200 likes | 313 Views
Virtual Memory. Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited. Memory Hierarchy. regs. Smaller, faster, costlier per byte. on-chip L1 cache (SRAM). on-chip L2 cache (SRAM). main memory (DRAM). Larger, slower, cheaper
E N D
Virtual Memory Topics • Virtual Memory Access • Page Table, TLB • Programming for locality • Memory Mountain Revisited
Memory Hierarchy regs Smaller, faster, costlier per byte on-chip L1 cache (SRAM) on-chip L2 cache (SRAM) main memory (DRAM) Larger, slower, cheaper per byte local secondary storage (local disks) remote secondary storage (tapes, distributed file systems, Web servers)
Why Caches Work Temporal locality: • Recently referenced items are likely to be referenced again in the near future Spatial locality: • Items with nearby addresses tend to be referenced close together in time block block
Cache (L1 and L2) Performance Metrics • Miss Rate • Fraction of memory references not found in cache (misses / accesses)= 1 – hit rate • Typical numbers (in percentages): • 3-10% for L1 • can be quite small (e.g., < 1%) for L2, depending on size, etc. • Hit Time • Time to deliver a block in the cache to the processor • includes time to determine whether the line is in the cache • Typical numbers: • 1-3 clock cycles for L1 • 5-20 clock cycles for L2 • Miss Penalty • Additional time required because of a miss • typically 50-400 cycles for main memory
Lets think about those numbers Huge difference between a hit and a miss • Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? • Consider: cache hit time of 1 cyclemiss penalty of 100 cycles • Average access time: 0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles 0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles
Types of Cache Misses Cold (compulsory) miss • Occurs on first access to a block • Spatial locality of access helps (also prefetching---more later) Conflict miss • Multiple data objects all map to the same slot (like in hashing) • e.g, block i must be placed in cache entry/slot: i mod 8 • replacing block already in that slot • referencing blocks 0, 8, 0, 8, ... would miss every time • Conflict misses are less of a problem these days • Set associative caches with 8, or 16 set size per slot help Capacity miss • When the set of active cache blocks (working set) is larger than the cache • This is where to focus nowadays
What about writes? Multiple copies of data exist: • L1, L2, Main Memory, Disk What to do on a write-hit? • Write-back (defer write to memory until replacement of line) • Need a dirty bit (line different from memory or not) What to do on a write-miss? • Write-allocate (load into cache, update line in cache) Typical • Write-back + Write-allocate Rare • Write-through (write immediately to memory, usually for I/O)
Main Memory is something like a Cache (for Disk) Driven by enormous miss penalty: • Disk is about 10,000xslower than DRAM DRAM Design: • Large page (block) size: typically 4KB
Virtual Memory Programs refer to virtual memory addresses • Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) • Each byte has its own address • System provides address space private to each process Allocation: Compiler and run-time system • All allocation within single virtual address space
Virtual Addressing Main memory • MMU = Memory Management Unit • MMU keeps mapping of VAs -> PAs in a “page table” 0: CPU Chip 1: 2: Virtual address (VA) Physical address (PA) 3: CPU MMU 4: 5: 6: 7: ... Data word
MMU Needs Table of Translations Main memory • MMU keeps mapping of VAs -> PAs in a “page table” 0: CPU Chip 1: 2: Virtual address (VA) Physical address (PA) 3: CPU MMU 4: 5: 6: 7: Page Table ...
Where is page table kept ? Main memory • In main memory – can be cached e.g., in L2 (like data) 0: CPU Chip 1: 2: Virtual address (VA) Physical address (PA) 3: CPU MMU 4: 5: Page Table 6: 7: ...
Speeding up Translation with a TLB Translation Lookaside Buffer (TLB) • Small hardware cache for page table in MMU • Caches page table entries for a number of pages (eg., 256 entries)
TLB Hit CPU Chip TLB PTE 2 3 VA Mem Page Table MMU 1 PA VA CPU 4 Data 5 A TLB hit saves you from accessing memory for the page table
TLB Miss CPU Chip TLB 4 PTE 2 VA Mem Page Table MMU 1 3 VA CPU PTE request PA 5 Data 6 A TLB miss incurs an additional memory access (the PT)
How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set • Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) • Thrashing:Performance meltdownwhere pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) • Will suffer TLB misses • Not as bad as page thrashing, but still worth avoiding
More on TLBs • Assume a 256-entry TLB, and each page is 4KB • Can only have TLB hits for 1MB of data (256*4kB = 1MB) • This is called the “TLB reach”---amount of mem TLB can cover • Typical L2 cache is 6MB • Hence should consider TLB-size before L2 size when tiling? • Real CPUs have second-level TLBs (like an L2 for TLB) • This is getting complicated to reason about! • Likely have to experiment to find best tile size
Memory Optimization: Summary Caches • Conflict Misses: • Not much of a concern (set-associative caches) • Cache Capacity: • Keep working set within on-chip cache capacity • Fit in L1 or L2 depending on working-set size Virtual Memory: • Page Misses: • Keep page-level working set within main memory capacity • TLB Misses: may want to keep working set #pages < TLB #entries
IA32 Linux Memory Layout Stack • Runtime stack (8MB limit) Data • Statically allocated data • E.g., arrays & strings declared in code Heap • Dynamically allocated storage • When call malloc(), calloc(), new() Text • Executable machine instructions • Read-only