What do we want from our computers?

What do we want from our computers? • correct resultswe assume this feature, but consider. . . who defines what is correct? • fastfast at what?(easy answer: fast at my programs)

performance fast slow ¢ $$$ price

Architectural features ways of increasing speed generally fall into 2 categories: • parallelism • memory hierarchies

parallelism Suppose we have 3 tasks: t1, t2, and t3. If independent, then A serial implementation on 1 computer: t1 t2 t3 A parallel implementation (given that we have 3 computers). t1 t2 t3

M memory woes: P physically separate memory makes memory accesses SLOW ! P and M co-located ?very expensive !or memory too small !

HW design technique to make some memory accesses complete faster is the implementation of hierarchical memory (also known as cacheing)

Recall the fetch and execute cycle:  fetch instruction   PC update  decode  get operands ( for a load)  do operation  store result ( for a store)  requires a memory access

Now look at the memory access patterns of lots of programs. In general, memory access patterns are not random. They exhibit locality 1. temporal 2. spatial

temporal locality Recently referenced memory locations are likely to referenced again (soon!) loop: instr 1 @A1 instr 2 @ A2 instr 3 @ A3 b loop @ A4 Instruction stream references: A1 A2 A3 A4 A1 A2 A3 A4 A1 A2 A3 ... Note that the same memory location is repeatedly read (for the fetch).

spatial locality Memory locations near to referenced locations are likely to also be referenced. Code must do something to each element of the array. Must load each element. array ... memory

The fetch of the code exhibits a high degree of spatial locality. I2 is next to I1. I1 I2 I3 If these instructions are not branches,then we fetchI1 I2 I3 etc. I4 I5 ...

A cache is designed to hold copies of a subset of memory locations. • smaller (in terms of bytes) than main memory • faster than main memory • co-located: processor and cache are on the same chip

Intel 386 chip (1985 image)

Pentium II (1997 image)

P and C M P sends memory request to C. • hit: requested location's copy is in the C • miss: requested location's copy is NOT in the C. So, send the memory access to M.

Needed terminology: # of misses miss ratio = total # of accesses # of hits hit ratio = total # of accesses or 1 - miss ratio You already assumed thattotal # of accesses = # of misses + # of hits

So, when designing a cache, keep likely to be referenced (again) bytes and their neighbors in the cache. . . So, what is in the cache is different for each different program. On average, for a given program: = Tc + (miss ratio) (Tm) Average Memory Access Time

For example:Tc = 1 nsecTm = 20 nsec A specific program has 98% hits . . . AMAT = 1 + (.02) (20) = 1.4 nsec Each individual memory access takes1 nsec (hit)or 21 nsec (miss)

Divide all of memory up into fixed-size blocks 1 block Copy the entire blockinto the cache. . . . Make the block size greater than1 word.

An unrealistic cache,with 4block frames this is 1 frame block 00 another frame block 01 yet another frame block 10 and a 4th frame block 11

Each main memory block maps to a specific block frame. main memory cache 00 01 10 11 2 bits of the addressdefine this mapping . . .

Take advantage of spatial locality by making the block size greater than 1 word. On a miss, copy the entire block into the cache, and then keep it there as long as possible. (Why?) How the cache uses the address to do a look up: byte/word within block ? index # which block frame

Which block frame is known asindex #or (sometimes) line # • But, many main memory blocks map to the same cache block frame. . .only one may be in the frame at a time! • We must distinguish which one is in the frame right now.

tag • most significant bits of the block's address • to distinguish which main memory block is in the cache block frame • tag is kept in the cache together with its data block

how the address is utilized by the cache (so far) address tag index # byte w/i block tags data blocks 00 01 10 11

Still missing. . .must distinguish block frames that have nothing in them from ones that have a block from main memory(consider power up for a computer system: nothing is in the cache) • We need 1 bit per block,most often called a valid bit(sometimes called a present bit)

cache access (or cache lookup) • index # is used to find the correct block frame • Is block frame valid?YES: Compare address tag to block frame's tag: match: HIT no match: MISSNO: MISS

Completed diagram of the cache: address tag index # byte w/i block valid tags data blocks 00 01 10 11

This cache is called direct mapped or 1-way set associative or set associative, with a set size of 1 Each index # maps to exactly 1 block frame

V Tag Data direct mapped 3 bits for index # same amount of data V Tag Data V Tag Data 2-way setassociative 2 bits for index #

How about 4-way set associative, or 8-way set associative? For a fixed number of block frames, • larger set size tends to lead to higher hit ratios  • larger set size means that the amount of HW (circuitry) goes up, and Tc increases 

Implementing writes V Tag Data   memory 1 write through change data in the cache, and send the write to main memory slow  , but very little circuitry 

2 write back • at first, change data in the cache • write to memory only when necessary V Tag Data dirty bit is set on a write, to identify blocks to be written back to memory dirty bit when a program completes, all dirty blocks must be written to memory. . .

2 write back (continued) • faster multiple stores to the same location result in only 1 main memory access • more circuitry  • must maintain the dirty bit • dirty miss : a miss caused by a read or write to a block not in the cache, but the required block frame has its dirty bit set. So, there is a write of the dirty block, followed by a read of the requested block.

V Tag Data How about2 separate caches ? I-cache for instructions only can be rather small,and still have excellent performance. V Tag Data V Tag Data D-cache for data only needs to be fairly large

We can send memory accesses to the 2 caches independently. . .  (increased parallelism) I fetch M P D load/store

Called an L1 cache (level 1) • This hierarchy works so well, that most systems have 2 levels of cache. M C P L1 L2 M P

What do we want from our computers?