C SINGH, JUNE 7-8, 2010

Advanced Computers Architecture Lecture 9 By Rohit Khokher Department of Computer Science, Sharda University, Greater Noida, India C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 1 IWW 2010, ISATANBUL, TURKEY

Cache • Small amount of fast memory • Sits between normal main memory and CPU • May be located on CPU chip or module

Cache/Main Memory Structure

Cache operation – overview • CPU requests contents of memory location • Check cache for this data • If present, get from cache (fast) • If not present, read required block from main memory to cache • Then deliver from cache to CPU • Cache includes tags to identify which block of main memory is in each cache slot

Cache Read Operation - Flowchart

Cache Design • Size • Mapping Function • Replacement Algorithm • Write Policy • Block Size • Number of Caches C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Size • Cost • More cache is expensive • Speed • More cache is faster (up to a point) • Checking cache for data takes time C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Processor Type Year of Introduction L1 cachea L2 cache L3 cache Comparison of Cache Sizes IBM 360/85 Mainframe 1968 16 to 32 KB — — PDP-11/70 Minicomputer 1975 1 KB — — VAX 11/780 Minicomputer 1978 16 KB — — IBM 3033 Mainframe 1978 64 KB — — IBM 3090 Mainframe 1985 128 to 256 KB — — Intel 80486 PC 1989 8 KB — — Pentium PC 1993 8 KB/8 KB 256 to 512 KB — PowerPC 601 PC 1993 32 KB — — PowerPC 620 PC 1996 32 KB/32 KB — — PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB IBM S/390 G6 Mainframe 1999 256 KB 8 MB — Pentium 4 PC/server 2000 8 KB/8 KB 256 KB — IBM SP High-end server/ supercomputer 2000 64 KB/32 KB 8 MB — CRAY MTAb Supercomputer 2000 8 KB 2 MB — Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB — Itanium 2 PC/server 2002 32 KB 256 KB 6 MB IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB — Advanced Computers Architecture, UNIT 2 C SINGH, JUNE 7-8, 2010 IWW 2010, ISATANBUL, TURKEY

Mapping Function • Mapping functions are used as a way to decide which main memory block occupies which line of cache. As there are less lines of cache than there are main memory blocks, an algorithm is needed to decide this. • Three cache mapping functions, i.e., methods of addressing to locate data within a cache. • Direct • Full Associative • Set Associative • Each of these depends on two facts: C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

First Concept RAM is divided into blocks of memory locations. In other words, memory locations are grouped into blocks of 2n locations where n represents the number of bits used to identify a word within a block. These n bits are found at the least-significant end of the physical address. The image below has n=2 indicating that for each block of memory, there are 22 = 4 memory locations. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Therefore, for this example, the least two significant bits of an address indicate the location within a block while the remaining bits indicate the block number. The table below shows an example with a 20 bit address with four words per block. Notice that for each group of four words, the word bits take on each of the four possible values allowed with 2 bits while the block identification bits remain constant., C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Second Concept The cache is organized into lines, each of which contains enough space to store exactly one block of data and a tag uniquely identifying where that block came from in memory. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Direct Mapping This is the simplest form of mapping. One block from main memory maps into only one possible line of cache memory. As there are more blocks of main memory than there are lines of cache, many blocks in main memory can map to the same line in cache memory.To implement this function, use the following formula:α = β % γwhere, α is the cache line number, β is the block number in main memory, γ is the total number of lines in cache memory and % being the modulus operator. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

The address for this example is broken down something like the following: • The address is broken into three parts: • (s-r) MSB bits represent the tag to be stored in a line of the cache corresponding to the block stored in the line; • r bits in the middle identifying which line the block is always stored in; and • w LSB bits identifying each word within the block. This means that: • The number of addressable units = 2s+w words or bytes • The block size (cache line width not including tag) = 2w words or bytes • The number of blocks in main memory = 2s (i.e., all the bits that are not in w) • The number of lines in cache = m = 2r • The size of the tag stored in each line of the cache = (s - r) bits C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Disadvantage There is a fixed cache location for any given block in main memory. If two blocks of memory sharing the same cache line are being continually referenced, cache misses would occur and these two blocks would continuously be swapped, resulting in slower memory access due to the time taken to access main memory C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Associative Mapping • A main memory block can load into any line of cache • Memory address is interpreted as tag and word • Tag uniquely identifies block of memory • Every line’s tag is examined for a match • Cache searching gets expensive • Address in this mapping • . Tag word id bits C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

The number of addressable units = 2s+w words or bytes • The block size (cache line width not including tag) = 2w words or bytes • The number of blocks in main memory = 2s (i.e., all the bits that are not in w) • The number of lines in cache is not dependent on any part of the memory address • The size of the tag stored in each line of the cache = s bits C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Set Associative • Cache is divided into a number of sets • Each set contains a number of lines • A given block maps to any line in a given set • e.g. Block B can be in any line of set i • e.g. 2 lines per set • 2 way associative mapping • A given block can be in one of 2 lines in only one set C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Replacement Algorithms For direct mapping where there is only one possible line for a block of memory, no replacement algorithm is needed. For associative and set associative mapping, however, an algorithm is needed. For maximum speed, this algorithm is implemented in the hardware. Four of the most common algorithms are: least recently used This replaces the candidate line in cache memory that has been there the longest with no reference to it. first in first out This replaces the candidate line in the cache that has been there the longest. least frequently used This replaces the candidate line in the cache that has had the fewest references. random replacement This algorithm randomly chooses a line to be replaced from among the candidate lines. Studies have shown that this yields only slightly inferior performance than other algorithms. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Write Policy This is important because if changes were made to a line in cache memory, the appropriate changes should be made to the block in main memory before removing the line from the cache. The problems to contend with are more than one device may have access to main memory (I/O modules). If more than one processor on the same bus with its own cache is involved, the problem becomes more complex. Any change in either cache or main memory could invalidate the others. WRITE THROUGH The simplest technique is called “write through”. In using this technique, both main memory and cache are written to when a write operation is performed, ensuring that main memory is always valid. The main disadvantage of this technique is that it may generate substantial main memory traffic, causing a bottle neck and decreasing performance. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

WRITE BACK An alternative technique, known as “write back” minimizes main memory writes. Updates are made only in the cache. An update bit associated with the line is set. Main memory is updated when the line in cache gets replaces only if the update bit has been set. The problem with this technique is that all changes to main memory have to be made through the cache in order not to invalidate parts of main memory, which potentially may cause a bottle neck. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Line Size When a block of data is retrieved from main memory and put into the cache, the desired word and a number of adjacent words are retrieved. As the block size increases from a very small size, the hit ratio will at first increase due to the principle of locality of reference, which says that words in the vicinity of a referenced word are more likely to be referenced in the near future. As the block size increases, however, the hit ratio will decrease as the probability of reusing the new information becomes less than that of using the information that has been replaced. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Number of Caches Two aspects of this are: Multilevel Due to increased logic density, it has become possible to have a cache on the same chip as the processor. This increases execution time as less activity over an external bus is needed. Even though an on-chip cache exists, it is typically desirable to have an off-chip cache as well. This means that if a miss occurs on the level 1 cache (on-chip), instead of retrieving the data from the slower main memory, information may be retrieved from the level 2 cache, which, although slower than level 1 cache, is still appreciably faster than main memory. Some level 2 caches are stored on-chip and a level 3 cache has been implemented off-chip. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

Unified/Split • Two types of words exist that are stored in cache, namely data and instruction. It has become common to split the cache into two to separate these words.Two potential advantages to a unified cache are: • A greater hit rate than split caches because the load between instruction and data fetches are balanced automatically. • Only one cache needs to be designed and implemented. • The key advantage of the split cache design is that it eliminates contention for the cache between the instruction fetch/decode unit and the execution unit. This is important for designs that rely on pipelining of instructions. C SINGH, JUNE 7-8, 2010 Advanced Computers Architecture, UNIT 2 IWW 2010, ISATANBUL, TURKEY

C SINGH, JUNE 7-8, 2010