Memory Organization 2: Cache Memories

Memory Organization 2: Cache Memories CE 140 A1/A2 30 July 2003

Required Reading • Ch 5, Hamacher

Memory Hierarchy Increasing Speed and Cost Per Bit Increasing Size Registers Caches Main Memory Magnetic Disk Optical Storage Tape

Principle of Locality of Reference • Programs tend to reuse data and instructions they have used recently • Instructions in localized areas are executed repeatedly • 90% of execution time spent only only 10% of code • “Make the common case fast”  favor accesses to such data • Keep recently accessed data in the fastest memory

Temporal Locality • A recently executed instruction is likely to be executed again very soon

Spatial Locality • Instructions in close proximity to a recently executed instruction are likely to be executed soon

Memory Hierarchy • Provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast the fastest level • All data in one level is also found in the level below

Memory Hierarchy • Importance increased with advances in performance of processors • 1980: most processors without caches • 1995: two levels of caches • Bridge the processor-memory performance gap

CPU 1000 100 Processor-Memory Performance Gap Performance 10 Memory 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Processor-Memory Gap Source: Computer Architecture: A Quantitative Approach by Patterson/Hennessy

Cache • Small, fast storage used to improve speed of access to slower, larger memory • Exploits spatial and temporal locality

Cache • Temporal Locality: Whenever an item is first needed, it is first brought to the cache, where it will hopefully remain until it is needed again. Also influences choice on which item to discard when cache is full • Spatial Locality: Instead of fetching just one item into the cache, fetch several adjacent data items as well (block/cache line)

Memory Hierarchy Design • Block placement: Where can a block be placed in the upper level? • Block identification: How is a block found if it is in the upper level? • Block replacement: Which block should be replaced on a miss? • Write strategy: What happens on a write?

Where can a block be placed in a cache? • Mapping function determines how a block is placed in the cache

Mapping Functions • Three Types • Direct Mapping • Associative Mapping • Set-Associative Mapping • Examples assume 64K (4K x 16 words) main memory and 2K (128 x 16 words) cache • 1 Block consists of 16 words

Where can a block be placed in a cache? How is a block found? MAIN MEMORY CACHE Block 0 Block 0 TAG Block 1 Block 1 TAG MAPPING FUNCTION Block 127 Block 128 Block 126 TAG Block 127 TAG Word Block 4095 16-Bit Address 12 4

Direct Mapping • Simplest • Block j of main memory maps onto block (j modulo 128) of the cache. • Example: Block 2103 of main memory maps to block (2103 mod 128) = block 55 • Each main memory block has only one place in cache • More than one block contends for only one cache position • Block Address MOD Number of Blocks in Cache

Direct Mapping • 16-bit address (64K words) • 16 words per block  lower 4 bits • Cache block position  middle 7 bits • 32 blocks are mapped to the same word • Higher 5 bits tell which of the 32 blocks are mapped • Higher 5 bits are stored in 5 tag bits associated with cache location

How is a block found if it is in the cache? Direct Mapping • Middle 7 bits select determine which location in cache is used • Higher-order 5 bits are matched with tag bits in cache to check if desired block is the one stored in the cache

Direct Mapping MAIN MEMORY Block 0 CACHE Block 1 Block 0 TAG Block 1 TAG Block 127 Block 128 Block 127 TAG Tag Block Word Block 4095 16-Bit Address 5 7 4

Associative Mapping • A block can be mapped to any available cache location • Higher 12 bits are stored in tag bits

How is a block found if it is in the cache? Associative Mapping • Tag bits (Higher-order 12 bits) of an address are compared with tag bits of each block to check if desired block is present • Higher cost than direct mapping due to need to search all 128 tags • Tags must be searched in parallel for performance reasons

Associative Mapping MAIN MEMORY Block 0 CACHE Block 1 Block 0 TAG Block 1 TAG Block 127 Block 128 Block 127 TAG Tag Word Block 4095 16-Bit Address 12 4

Set-Associative Mapping • Cache blocks are grouped into sets • A main memory block can reside in any block of a specific set • Less contention than direct mapping • Less cost than associative mapping • Set = (Block Address) MOD (Number of Sets in Cache) • k-way set associative cache: k blocks per set

How is a block found if it is in the cache? Set-Associative Mapping • Example: Cache groups two blocks per set  64 sets (6-bit set field) • 64 blocks can be mapped onto one set • Tag bits in each cache block store upper 6 bits of address to tell which of the 64 blocks are currently in the cache

Set-Associative Mapping MAIN MEMORY CACHE Block 0 Block 0 TAG Set 0 Block 1 Block 1 TAG Block 127 Block 128 Block 126 TAG Set 63 Block 127 TAG Tag Set Word Block 4095 16-Bit Address 6 6 4

Levels of Set Associativity • Direct Mapping: 1 block per set  128 sets • Fully Associative Mapping: 128 blocks per set  1 set • Set Associative Mapping is in between Direct and Fully Associative • Different mappings are just different degrees of set associativity

Which block should be replaced on a cache miss? • Replacement Algorithm • Determines which block in the cache is to be replaced in the event of a cache miss and the cache is full • Trivial for direct mapped caches

Which block should be replaced on a cache miss? • Replacement algorithms • Random Replacement • First-In First-Out (FIFO) • Optimal Algorithm • Least Recently Used (LRU) • Least Frequently Used • Most Frequently Used

Example of Replacement Algorithms • Assume • fully associative cache • Reference string • Sequence of block requests • Example • 3 2 3 6 7 3 3 5 4

Random Replacement • Simplest algorithm • Replaces elements at random • Spreads allocation uniformly • Quite effective in some cases

First-In First-Out (2 block-cache) 17 Cache Misses

Belady’s Anomaly • Increasing the number of blocks does not decrease the number of cache misses • For replacement algorithms, the number of cache misses may increase as the number of blocks increase

Optimal Algorithm • Replace the page that will not be used for the longest period of time • Guarantees lowest page fault rate for a fixed number of blocks • Needs prior knowledge of reference string

Least Recently Used (LRU) • Overwrite the block that has gone the longest time without being referenced • Cache controller tracks references through counters • Inefficient when accessing sequential elements of a large array

Least-Recently Used (4 block-cache) 12 Cache Misses

Least Frequently Used • Has a counter for the number of references that have been made to a block • Block with least frequency is replaced • FIFO is used as a tie breaker • Rationale: A block that is frequently accessed will be accessed again

Most Frequently Used • Replace the page with the highest count • Rationale: Page with the highest count will no longer be used and page with least count has yet to be used

What happens on a write? • Write policies • Write-through • Write-back

Write-Through • Cache location and main memory location are updated simultaneously • Simpler but results in unnecessary write operations if word is updated many times during its cache residency • Requires only valid bit

Valid Bit • Indicate if block stored in cache is still valid • Set to 1 when block is initially loaded to cache • Transfers from disk to main memory use DMA, bypass cache • When main memory block is updated by a source that bypasses the cache, if block is also in cache, its valid bit is set to 0

Write-Back • Update only the cache location and mark it updated using dirtybit/modified bit • Main memory location is updated later when block is replaced • Writes at the speed of the cache • Also results in unnecessary writes because whole block is written back to memory even if only one word is updated • Requires valid bit and dirty bit

Dirty Bit • Tells whether block in cache has been modified/has newer data than main memory block • Problem: Transfer from main memory to disk bypassing the cache • Solution: Flush the cache (write back all dirty blocks) before DMA transfer begins

What happens on a write miss? • No-write allocate: Data is directly written to main memory • Write allocate: Block is first loaded from cache, then cache block is written to

Write Buffer • Used as temporary holding location for data to be written to memory • Processor need not wait for write to finish • Data in write buffer will be written when memory is available for writing • Works for both write-through and write-back caches

Example of Mapping Techniques • Consider data cache with 8 blocks of data • Each block of data consists of only one word • These are greatly simplified parameters • Consider 4 x 10 array of numbers, arranged in column order • 40 elements = 28h stored from 7A00h to 7A27h

Example of Mapping Techniques 13 Direct Mapped 15 Set-Associative 16 Associative 16-Bit Address 16

Example of Mapping Techniques • Consider the following algorithm • This gets the average of the first row (0), and stores the value of the element divided by the average of all the elements SUM := 0 for j:= 0 to 9 do SUM := SUM + A(0,j) end AVE := SUM / 10 for i:= 9 downto 0 do A(0,i) := A(0,i) / AVE end

Example of Mapping Techniques SUM := 0 for j:= 0 to 9 do SUM := SUM + A(0,j) end AVE := SUM / 10 for i:= 9 downto 0 do A(0,i) := A(0,i) / AVE end

Memory Organization 2: Cache Memories

Memory Organization 2: Cache Memories

Presentation Transcript

Cache Memories

CH05 Internal Memory

Memory Organization

Memory Organization

Memory

Caches

CS1104 – Computer Organization

DDM – A Cache Only Memory Architecture

The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2)

Parallelizing Programs

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems

Memory Cache – performance considerations

Cache

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

Cache Memories

Unit 2

Tile Size Selection Using Cache Organization and Data Layout

361 Computer Architecture Lecture 14: Cache Memory

William Stallings Computer Organization and Architecture 7th Edition

Cache memory

CSC I 2510 Computer Organization

Cache Lab Implementation and Blocking