420 likes | 719 Views
Today's Agends. Memory Hierarchy Locality of Reference Cache Memory Organization Virtual Memory. Bus Structure for a computer system. CPU chip. R egister file. ALU. M ain memory. B us interface. I/O bus. USB controller. G raphics adapter. D isk controller. M ouse. K eyboard.
E N D
Today's Agends Memory Hierarchy Locality of Reference Cache Memory Organization Virtual Memory
Bus Structure for a computer system CPU chip Register file ALU Main memory Bus interface I/O bus USB controller Graphics adapter Disk controller Mouse Keyboard Monitor Disk
The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds. Disk SSD DRAM CPU
Locality to the Rescue! The key to bridging this CPU-Memory gap is a fundamental property of computer programs known as locality
Locality of Reference • Principle of Locality:Programs tend to use data and instructions with addresses near or equal to those they have used recently • Temporal locality: • Recently referenced items are likely to be referenced again in the near future • Spatial locality: • Items with nearby addresses tend to be referenced close together in time
Locality Example • Data references • Reference array elements in succession (stride-1 reference pattern). • Reference variable sum each iteration. • Instruction references • Reference instructions in sequence. • Cycle through loop repeatedly. sum = 0; for (i = 0; i<n; i++) sum += a[i]; return sum; Spatial locality Temporal locality Spatial locality Temporal locality
Memory Hierarchies • Some fundamental and enduring properties of hardware and software: • Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). • The gap between CPU and main memory speed is widening. • Well-written programs tend to exhibit good locality. • They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
An Example Memory Hierarchy L0: CPU registers hold words retrieved from L1 cache Registers L1: L1 cache (SRAM) L1 cache holds cache lines retrieved from L2 cache Smaller, faster, costlier per byte L2: L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory L3: Main memory (DRAM) Larger, slower, cheaper per byte Main memory holds disk blocks retrieved from local disks Local secondary storage (local disks) L4: Local disks hold files retrieved from disks on remote network servers Remote secondary storage (tapes, distributed file systems, Web servers) L5:
Caches • Cache:A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. • Fundamental idea of a memory hierarchy: • For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. • Why do memory hierarchies work? • Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. • Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. • Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
General Cache Concepts Smaller, faster, more expensive memory caches a subset of the blocks Cache 4 8 9 14 10 3 Data is copied in block-sized transfer units 4 10 Larger, slower, cheaper memory viewed as partitioned into “blocks” Memory 0 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15
General Cache Concepts: Hit Data in block b is needed Request: 14 Block b is in cache: Hit! Cache 8 9 14 3 14 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
General Cache Concepts: Miss Data in block b is needed Request: 12 Block b is not in cache: Miss! Cache 8 9 14 3 12 Block b is fetched from memory Request: 12 12 • Block b is stored in cache • Placement policy:determines where b goes • Replacement policy:determines which blockgets evicted (victim) Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 12 13 14 15
General Caching Concepts: Types of Cache Misses • Cold (compulsory) miss • Cold misses occur because the cache is empty. • Conflict miss • Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. • E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. • Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. • E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. • Capacity miss • Occurs when the set of active cache blocks (working set) is larger than the cache.
Examples of Caching in the Hierarchy Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By Registers 4-8 bytes words CPU core 0 Compiler TLB Address translations On-Chip TLB 0 Hardware L1 cache 64-bytes block On-Chip L1 1 Hardware L2 cache 64-bytes block On/Off-Chip L2 10 Hardware Virtual Memory 4-KB page Main memory 100 Hardware + OS Buffer cache Parts of files Main memory 100 OS Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer cache Parts of files Local disk 10,000,000 AFS/NFS client Browser cache Web pages Local disk 10,000,000 Web browser Web cache Web pages Remote server disks 1,000,000,000 Web proxy server
Put it together • The speed gap between CPU, memory and mass storage continues to widen. • Well-written programs exhibit a property called locality. • Memory hierarchies based on caching close the gap by exploiting locality.
Cache/Main Memory Structure CPU requests data from memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot
Cache Design Issues • Size • Mapping Function • Replacement Algorithm • Write Policy • Block Size • Number of Caches Size does matter • Cost • More cache is expensive • Speed • More cache is faster (up to a point) • Checking cache for data takes time
Mapping Function • Cache of 64kByte • Cache block of 4 bytes • i.e. cache is 16k (214) lines of 4 bytes each • 16MBytes main memory • 24 bit address • (224=16M) • Direct Mapping • Each block of main memory maps to only one cache line i.e. if a block is in cache, it must be in one specific place • Address is in two parts • Least Significant w bits identify unique word. Most Significant sbits specify one memory block • The MSBs are split into a cache line field r and a tag of s-r (most significant)
Direct Mapping: Address Structure • 24 bit address • 2 bit word identifier (4 byte block) • 22 bit block identifier • 8 bit tag (=22-14) • 14 bit slot or line • No two blocks in the same line have the same Tag field • Check contents of cache by finding line and checking Tag Tag s-r Line or Slot r Word w 2 14 8
Direct Mapped Cache • Simple • Inexpensive • Fixed location for given block • If a program accesses 2 blocks that map to the same line repeatedly, • cache misses are very high • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory • 2s+w/2w = 2s • Number of lines in cache = m = 2r • Size of tag = (s – r) bits
Fully Associative Mapping • A main memory block can load into any line of cache • Memory address is interpreted as tag and word • Tag uniquely identifies block of memory • Every line’s tag is examined for a match • Cache searching gets expensive
Fully Associative Mapping: Address Structure Word 2 bit Tag 22 bit • 22 bit tag stored with each 32 bit block of data • Compare tag field with tag entry in cache to check for hit • Least significant 2 bits of address identify which 16 bit word is required from 32 bit data block
Set Associative Mapping • Cache is divided into a number of sets • Each set contains a number of lines • A given block maps to any line in a given set • e.g. Block B can be in any line of set i • e.g. 2 lines per set • 2 way associative mapping • A given block can be in one of 2 lines in only one set
Word 2 bit Tag 9 bit Set 13 bit Two-Way Set Associative Mapping Address Structure • Use set field to determine cache set to look in • Compare tag field to see if we have a hit
Set Associative Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2s • Number of lines in set = k • Number of sets = v = 2d • Number of lines in cache = kv = k * 2d • Size of tag = (s – d) bits
Replacement Algorithms • No choice in Direct mapping because each block only maps to one line!!! • So it is applicable for other mapping functions • Hardware implemented algorithm (speed) • Least Recently used (LRU) • e.g. in 2 way set associative • Which of the 2 block is LRU? • First in first out (FIFO) • Replace block that has been in cache longest • Least frequently used • Replace block which has had fewest hits • Random
What about writes? • Multiple copies of data exist: • L1, L2, Main Memory, Disk • What to do on a write-hit? • Write-through (write immediately to memory) • Write-back (defer write to memory until replacement of line) • Need a dirty bit (line different from memory or not) • What to do on a write-miss? • Write-allocate (load into cache, update line in cache) • Good if more writes to the location follow • No-write-allocate (writes immediately to memory) • Typical • Write-through + No-write-allocate • Write-back + Write-allocate
Intel Core i7 Cache Hierarchy Processor package Core 0 Core 3 • L1 i-cache and d-cache: • 32 KB, 8-way, • Access: 4 cycles • L2 unified cache: • 256 KB, 8-way, • Access: 11 cycles • L3 unified cache: • 8 MB, 16-way, • Access: 30-40 cycles • Block size: 64 bytes for all caches. Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache … L2 unified cache L2 unified cache L3 unified cache (shared by all cores) Main memory
Cache Performance Metrics • Miss Rate • Fraction of memory references not found in cache (misses / accesses)= 1 – hit rate • Typical numbers (in percentages): • 3-10% for L1 • can be quite small (e.g., < 1%) for L2, depending on size, etc. • Hit Time • Time to deliver a line in the cache to the processor • includes time to determine whether the line is in the cache • Typical numbers: • 1-2 clock cycle for L1 • 5-20 clock cycles for L2 • Miss Penalty • Additional time required because of a miss • typically 50-200 cycles for main memory (Trend: increasing!)
Writing Cache Friendly Code • Make the common case go fast • Focus on the inner loops of the core functions • Minimize the misses in the inner loops • Repeated references to variables are good (temporal locality) • Stride-1 reference patterns are good (spatial locality) Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.
Virtual Memory • Demand paging • Do not require all pages of a process in memory • Bring in pages as required • Page fault • Required page is not in memory • Operating System must swap in required page • May need to swap out a page to make space • Select page to throw out based on recent history
Thrashing • Too many processes in too little memory • Operating System spends all its time swapping • Little or no real work is done • Disk light is on all the time • Solutions • Good page replacement algorithms • Reduce number of processes running • Fit more memory
Advantage • We do not need all of a process in memory for it to run • We can swap in pages as required • So - we can now run processes that are bigger than total memory available! • Main memory is called real memory • User/programmer sees much bigger memory - virtual memory
Translation Lookaside Buffer (TLB) • Since the page tables vary in size. • Require Larger page table for larger size processes • Its not possible to store it in registers. • Need to store it in main memory • Every virtual memory reference causes two physical memory access • Fetch page table entry • Fetch data • Slows down the system • To overcome this problem special cache memory can be used to store page table entries • TLB