410 likes | 659 Views
Memory Design Principles. Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component, almost as fast as the fastest component. Memory Hierarchy Design.
E N D
Memory Design Principles • Principle of locality dominates design • Smaller = faster • Hierarchy goal: total memory system almost as cheap as the cheapest component, almost as fast as the fastest component
Memory Hierarchy Design • Chapter 5 covers the memory hierarchy, mainly the caches and main memory • Later in the course, chapter 7, we discuss the lower level – the I/O system • Several performance examples will be studies, along with a continuing example: the HP AlphaServer ES40 using the Alpha 21264 microprocessor
Remarks • Most of this chapter is aimed at performance issues surrounding the transfer of data and instructions between the CPU and the cache • Some main memory material is presented in the chapter too
Outline • Four memory questions • Block placement • Block ID • Block replacement • Write strategy • HP AlphaServer ES40 • Cache performance • Three examples • Out-of-order processors
Outline - continued • Improving cache performance • Reducing miss penalty • Reducing miss rate • Use of parallelism • Reducing the hit time • Main memory – improving performance • Real world application of concepts – the AlphaServer with the 21264 memory hierarchy
The ABC’s of Caches • We are not going to review much, so be sure you know the material in section 5.2
Four Memory Hierarchy Questions • Block placement • Three categories of cache organization: • Direct mapped • Fully associative • Set associative • Make sure you understand all these categories • Today direct mapped, 2-way and 4-way set associative organizations dominate the market
How is a Block Found in Cache? Addresses have three fields: • Tag • Index • Block offset • Tags are searched in parallel to save time • Block offsets don’t need to be searched neither do indexes (explanations on page 399)
Which Block should be Replaced on a Cache Miss? • Situation: a CPU reference misses the cache, so a block must be brought in from main memory – but the cache is full – which block currently in cache should be replaced and sent back to main memory to allow space for the new block? • Strategies: • LRU • FIFO • Random
What Happens on a Write? • This gets more complicated and it really gets complicated when the multiprocessor problem is studied – more on that in chapter 6. • Be sure you fully understand write back, write through and dirty bits
What is the Big Problem with Writes? • On a read the block in cache can be read at the same time that the tag is being read and compared. If the read is a hit fine, if it is a miss just ignore it – no benefit, but no harm. • On a write modifying a block cannot begin until the tag is checked to see if the address is a hit. Because tag checking cannot occur in parallel, writes normally take longer than reads. Writes must be slower.
Reads vs. Writes - Frequency • This is only one study but it is representative: • Page 400 quoting from figure 2.32 has write composing about 7% of overall memory traffic
Real World Example:The Alpha 21264 Microprocessor • http://h18002.www1.hp.com/alphaserver/es40/ • Up to 4 processors, midlevel server family • 64KB instruction cache, 64KB data cache (on chip) • ES40 uses 8MB direct-mapped second-level cache (see pg 485) 1-16MB for family of servers • Benchmark results listed on web page
The Alpha 21264 Microprocessor - continued • Cache (much of these facts on pg. 404-5) • Two-set associative • FIFO • 64-byte blocks • Write back • 44-bit physical address – not all of 64 bit virtual space, but designers did not think anyone needs that big a virtual address space yet.
Cache Performance • Section 5.3 has several interesting and instructive examples. • Minimizing average memory access time is our usual goal for this chapter • However, the final goal is to reduce CPU execution time and these two goals can actually give different results (example on page 409) • Key parameters: miss rate, miss penalty, and hit time
Cache Performance Equations Average memory access time = Hit time + Miss rate X Miss Penalty • The main equation • Miss rate is often divided up into two separate miss rates: instruction miss rate and data miss rate
Cache Performance Equations - continued • Design parameters that effect the equation parameters: separate or unified cache, direct-mapped vs. associative cache, required time to find a block in cache and others. Average memory access time = Hit time + Miss rate X Miss Penalty
Out-of-Order Execution • Out-of-order execution changes our definitions • No single correct definition • Read pages 182-184 for discussion of out-of-order processing – we will get back to the topic later in the course
Reducing Cache Miss Penalty • Multilevel caches – first level is small enough to match the clock cycle time of a fast CPU, the second level is large enough to capture many accesses that otherwise would go to main memory and pay a large penalty • Check out the definitions of local miss rate and global miss rate • Study the two diagrams 5.10 and 5.11 for simulate studies of the Alpha 21264.
More Reducing Cache Miss Penalty Methods • Critical word and early restart • Give priority to read misses over write misses • Merging write buffers • Victim Caches • Remember what was discarded in case it is needed again • Alpha 21264 uses this concept • Study the next diagram
Reducing Miss Rate • Larger block size • Larger blocks take more advantage of spatial locality • But larger blocks increase the miss penalty • Study the following diagram • Larger caches • Obvious technique – but a drawback is longer hit time and higher cost
Reducing Miss Rate - continued • Higher associativity • Rules of thumb apply • Check them out on page 429 • Way prediction • Predicting the way or block within the set of the next cache access
Reducing Miss Rate - continued • Compiler optimization • Observations • Code can be rearranged without affecting correctness • Reordering of instructions can maximize use of data in a cache block before it is discarded • Very important – widely used in software especially scientific or DSP which have wide usage of matrices, iterated loops and very predictable code • Study the examples that go with the next two diagrams. Pages 432 – 5.
Reducing Hit TIme • Key observation: A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address. • Hence smaller = faster!
Summary of Techniques • Study over the (large) table on page 449
Virtual Memory • Note the relative penalty for missing cache vice the relative penalty for missing memory hence lower miss rates is always the preferred goal. (see table 5.32 on page 426)
Main Memory • Revisit the four questions asked about the memory hierarchy • Pages 463-5
Summary of Virtual Memory and Caches • “With virtual memory, first-level caches and second-level caches all mapping portions of the virtual and physical address space, it can get confusing what bits go where.” (page 467) • Study over the following diagram (figure 5.37) and review the like concepts if there is any confusion in your mind.
The Cache Coherence Problem • There is a short section (page 480-2) about this. Read it over – it is easy enough. We need to greatly expand upon this idea later for multiprocessors in chapter 6.
The Alpha 21264 Memory Hierarchy • Note the location of the following components: ITLB, DTLB, Victim buffer • Note also that the 21264 is an out-of-order execution processor that fetches up t four instructions per clock cycle • ES40 has a 8MB direct-mapped L2 cache • Way prediction is used in the instruction cache
Alpha 21264 Performance • Look over the benchmark results in table 5.45 and the comments on pages 487-8 • Comments – the SPEC95 programs do not tax the 21264 memory hierarchy, but the database benchmarks do. The text suggests that microprocessors designed for use in servers may see much heavier demands on the memory systems than for desktops.