170 likes | 377 Views
Processor support devices Part 2: Caches and the MESI protocol. dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems. The memory speed ‘gap’. High-performance processors are much too fast for the main memory they are connected to
E N D
Processor support devicesPart2: Caches and the MESI protocol dr.ir. A.C. VerschuerenEindhoven University of TechnologySection of Digital Information Systems
The memory speed ‘gap’ • High-performance processors are much too fast for the main memory they are connected to • Processors running at 1000 MegaHerz would like a memory read/write cycle time of 1 nanosecond • Large memories with (relatively) cheap RAM’s have cycle times on the order of 100 nanoseconds • 100 times slower, this speed gap continues to grow...
4 words in parallel 4 accesses in parallel read 0..3 4..7 read 0 1 2 3 4 5 6 7 use use 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 1) Wide memory words 2) Multiple memory 'banks' Wide words and memory banking • The gap can be closed IF the processor tolerates a long delay between the start and end of a cycle Complex timing Lots of pins
The big IF in closing the gap • Long memory access delays can be toleratedIF addresses are known in advance • True for sequential instruction reads • NOT true for most of the other read operations • Memory reading MUST become quicker! • Not interested in (timing of) write operations • Data & address to memory, then forget about it...
‘Cache’ is French:‘secret hiding place’ Small-scale virtual memory: the cache • A 'cache' is a small but very fast memory which contains the 'most active' memory words IF a requested memory word is in the cache THEN supply the word from the cache {very fast} ELSE supply the word from main memory {rather slow} and place it in the cache for later references (throwing out not used words when needed) • An ideal cache knows which words will be used soon • A good cache reaches 95% THEN and only 5% ELSE
Keeping the cache hidden • The cache must keep a copy of memory words • Memory mapped I/O ports are problematic • These can spontaneously change their value ! • Have to be made'non-cacheable’ at all times • Shared memory is problematic too • Make it non-cacheable (from all sides), or better • Inform all attached caches of changes (write actions)
Cache writing policies • 'write-through’: written data copied into memory • Option: write to cache only if word is already present • The amount of data in the cache can be reduced • Read after non-cached write requires true memory read • 'posted write’: writes buffered until the bus is free • Gives priority to reads, allows high speed write bursts • More hardware, delay between CPU and memory write • 'late write’: write only to make free space in cache • Reduces the amount of memory write cycles drastically • Complex cache control, especially with shared memory! Pentium
data bus switch data cache memory CPU main (80386) memory cache controller (82385) address address control administration control system bus CPU bus An example of a cache • To reduce the amount of administration memory, a single cache 'line' administrates 8 word blocks
17 10 3 2 32 bitsaddress: 'tag' line word byte word select 'hit' 'word valid' 'word valid' 1024lines 17 bit tags 32 bit data 32 bit data Lineselect 'line valid' word #0 word #7 Intel 82385 'direct mapped’ cache mode • Also known as '1-way set associative’prone to ‘tag clashing’ !
18 9 17 10 3 2 32 bitsaddress: 'tag' line word byte word select hitlogic 'hit' 'hit' 'word valid' 'word valid' 512lines 18bit tags 1024lines 17 bit tags 32 bit data 32 bit data Lineselect LRU bits 'line valid' word #0 word #7 Intel 82385 ’2-way set associative’ mode • ’Least Recently Used' bits indicate which set in each line has been used last (the other is replacement target)
The MESI protocol • Late write and shared memory combine badly • The 'MESI' protocol solves this with four states for each of the cache words (or lines) Modified: cached data differs from the main memory and is only located in this cache Exclusive: cached data is the same as main memory and is only located in this cache Shared: cached data is the same as main memory and also located in one or more other caches Invalid: cache word/line not loaded with memory data
State changes in the MESI protocol • Induced by processor read/write actions and actions of other cache controllers • Caches keep track of other read/write actions • Uses ’bus snooping’:monitoring the address and control buses when they are driven by someone else • During a memory access, other cache controllers indicate if one of them contains the accessed location Needed to decide between the Shared/Exclusive states!
Intel 82496 CPU accesses Pentium • A read hit reads the cache, does not change state • A read miss reads memory, other controllers check if they also contain the address read • A write hit handling depends on the state • If Shared, write is done in main memory too • If Exclusive or Modified, write is only done in cache • A write miss writes to memory, but not the cache Other caches may change their state! Normal MESI:write cache too
read hit write miss read miss & somewhere else Invalid Shared snoop write snoop write snoop read write hit (write to memory) snoop write read miss, only here snoop read (*) readhit Modified Exclusive write hit (setup for late write) read/writehit Intel 82496 state diagram snoop read anysnoop (*): This controller copies localdata to memory immediately
CPU chip mainmemoryhuge& very slow off-chipcachelarge(r)& slow(er) on-chipcachesmall& fast CPU Final remarks on caches (1) • High performance processors rely on caches • Main memory must be accessed in a single clock cycle • At 1 GHz, the cache must be on the CPU chip • But a large & fast cache takes a lot of chip space! Second level cache First level cache
Final remarks on caches (2) • The off-chip cache becomes as slow as main memory was some time ago... • Second level cache placed on the CPU chip too • Examples: power-PC, Crusoe (both > 256 KiloByte!) • The external cache becomes a third-level cache • Data transfer between on-chip caches can be done a complete cache line in parallel: a huge speedup