590 likes | 610 Views
Learn about von Neumann architecture, memory hierarchy, cache management, registers, and cache jargon in computer systems. Explore concepts like associativity, misses, locality, and more. Get insights into quantifying and classifying cache locality.
E N D
Computer Systems PrinciplesArchitecture Emery Berger and Mark Corner University of Massachusetts Amherst
The Memory Hierarchy • Registers • Caches • Associativity • Misses • “Locality” registers L1 L2 RAM
Registers • Register = dedicated name for word of memory managed by CPU • General-purpose: “AX”, “BX”, “CX” on x86 • Special-purpose: • “SP” = stack pointer • “FP” = frame pointer • “PC” = program counter SP arg0 arg1 arg0 arg1 arg2 FP
Registers • Register = dedicated name for one word of memory managed by CPU • General-purpose: “AX”, “BX”, “CX” on x86 • Special-purpose: • “SP” = stack pointer • “FP” = frame pointer • “PC” = program counter • Change processes:save current registers &load saved registers =context switch SP arg0 arg1 FP
Caches • Access to main memory: “expensive” • ~ 100 cycles (slow, but relatively cheap ($)) • Caches: small, fast, expensive memory • Hold recently-accessed data (D$) or instructions (I$) • Different sizes & locations • Level 1 (L1) – on-chip, smallish • Level 2 (L2) – on or next to chip, larger • Level 3 (L3) – pretty large, on bus • Manages lines of memory (32-128 bytes)
Memory Hierarchy • Higher = small, fast, more $, lower latency • Lower = large, slow, less $, higher latency registers 1-cycle latency 2-cycle latency L1 evict D$, I$ separate load 7-cycle latency L2 D$, I$ unified RAM 100 cycle latency Disk 40,000,000 cycle latency Network 200,000,000+ cycle latency
Orders of Magnitude • 10^0 registers L1
Orders of Magnitude • 10^1 L2
Orders of Magnitude • 10^2 RAM
Orders of Magnitude • 10^3
Orders of Magnitude • 10^4
Orders of Magnitude • 10^5
Orders of Magnitude • 10^6
Orders of Magnitude • 10^7 Disk
Orders of Magnitude • 10^8 Network
Orders of Magnitude • 10^9 Network
Cache Jargon • Cache initially cold • Accessing data initially misses • Fetch from lower level in hierarchy • Bring line into cache (populate cache) • Next access: hit • Warmed up • cache holds most-frequently used data • Context switch implications? • LRU: Least Recently Used • Use the past as a predictor of the future
Cache Details • Ideal cache would be fully associative • That is, LRU (least-recently used) queue • Generally too expensive • Instead, partition memory addresses and put into separate bins divided into ways • 1-way or direct-mapped • 2-way = 2 entries per bin • 4-way = 4 entries per bin, etc.
Associativity Example • Hash memory based on addresses to different indices in cache
Miss Classification • First access = compulsory miss • Unavoidable without prefetching • Too many items in way = conflict miss • Avoidable if we had higher associativity • No space in cache = capacity miss • Avoidable if cache were larger • Invalidated = coherence miss • Avoidable if cache were unshared
Quick Activity • Cache with 8 slots, 2-way associativity • Assume hash(x) = x % 4 (modulus) • How many misses? • # compulsory misses? • # conflict misses? • # capacity misses? 10 2 0
Locality • Locality = re-use of recently-used items • Temporal locality: re-use in time • Spatial locality: use of nearby items • In same cache line, same page (4K chunk) • Intuitively – greater locality = fewer misses • # misses depends on cache layout, # of levels, associativity… • Machine-specific
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 7 3 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 7 3 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 2 7 3 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 2 7 3 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 3 2 7 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 3 2 7 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Start with total misses on right hand side • Subtract histogram values 1 1 3 3 3 3 1 2 3 4 5 6
Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Start with total misses on right hand side • Subtract histogram values • Normalize .3 .3 1 1 1 1
Hit Curve Exercise • Derive hit curve for following trace:
1 2 2 2 3 3 4 5 6 Hit Curve Exercise • Derive hit curve for following trace: 1 2 3 4 5 6 7 8 9
1 2 2 2 3 3 4 5 6 1 2 3 4 5 6 7 8 9 Hit Curve Exercise • Derive hit curve for following trace:
What can we do with this? • What would be the hit rate • with a cache size of 4 or 9?
Simple cache simulator • Only argument is N, length of LRU queue • Read in addresses (ints) from cin • Output hits & misses to cout • queue<int> • push_front (v) = put v on front of queue • pop_back() = remove back from queue • erase(i) = erase element (iterator i) • size() = number of elements • for (queue<int>::iterator i = q.begin(); i < q.end(); ++i) cout << *i << endl;
Important CPU Internals • Other issues that affect performance • Pipelining • Branches & prediction • System calls (kernel crossings)
Scalar architecture • Straight-up sequential execution • Fetch instruction • Decode it • Execute it • Problem: I or D cache miss • Result – stall: everything stops • How long to wait for miss? • long time compared to CPU