210 likes | 287 Views
SOFTENG 363. Computer Architecture Cache John Morris ECE/CS, The University of Auckland. Iolanthe I at 13 knots on Cockburn Sound, WA. Cache. Small, fast memory Typically ~64kbytes ( Level 1 – 1998+) 2 cycle access time Same die as processor ‘High-end’ CPUs
E N D
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA
Cache • Small, fast memory • Typically ~64kbytes (Level 1 – 1998+) • 2 cycle access time • Same die as processor • ‘High-end’ CPUs • 2 levels of cache on main die • L1 – 64kb, L2 - ~1Mbyte (but slower!) • “Off-chip” cache possible • Custom cache chip closely coupled to processor • Use fast static RAM (SRAM) rather thanslower dynamic RAM • 2nd level of the memory hierarchy • “Caches” most recently used memory locations “closer” to the processor • closer = closer in time
Cache • Etymology • cacher(French) = “to hide” • Transparent to a program • Programs simply run slower without it • Modern processors rely on it • Reduces the cost of main memory access • Enables instruction/cycle throughput • Typical program • ~25% memory accesses
Cache • Relies upon locality of reference • Programs continually use - and re-use -the same locations • Instructions • loops, • common subroutines • Data • look-up tables • “working” data sets
Cache - operation • Memory requests checked in cache first • If the word sought is in the cache,it’s read from cache (or updated in cache) • Cache hit • If not, request is passed to main memoryand data is read (written) there • Cache miss VA PA MMU PA Main Mem CPU Cache D or I D or I
Cache - operation • Hit rates of 96% are usual • Cache: 64 kbytes • Effective Memory Access Time • Cache: 2 cycles • Main memory: 100 cycles • Average access: 0.96*2 + 0.04*100 = 4.2cycles • In general, if there are are n levels of memory,Avg memory access time= Sfjtjacc • wherefj = fraction of accesses to memory levelj • tjacc = access time for memory levelj j=1,n
Cache - organisation • Direct-mapped cache • Each word in the cache has a tag • Assume • cache size - 2kwords • machine words - p bits • byte-addressed memory • m = log2 ( p/8 ) bits not used to address words • m = 2 for 32-bit machines Address format p bits p-k-m k m tag cache address byte address
Cache - organisation A cache line • Direct-mapped cache data tag 2klines memory p p-k-m Hit? p-k-m k m CPU tag cache address byte address Memory address
Cache - Direct Mapped • Conflicts • Two addresses separated by 2k+mwill hit the same cache location • 32-bit machine, 64kbyte (16kword) cache • m = 2, k = 14 • Any program or data set larger than 64kb will generate conflicts • On a conflict, the ‘old’ word is flushed • Unmodified word • ( Program, constant data ) • overwritten by the new data from memory • Modified data needs to be written back to memory before being overwritten
Cache - Conflicts • Modified or dirty words • When a word is modified in cache • Write-back cache • Only writes data back when needed • Misses • Two memory accesses • Write modified word back • Read new word • Write-through cache • Low priority write to main memory is queued • Processor is delayed by read only • Memory write occurs in parallel with other work • Instruction and necessary data fetches take priority
Cache - Write-through or write-back? • Write-through • Allows an intelligent bus interface unitto make efficient use of a serious bottle-neck Processor - memory interface(Main memory bus) • Reads (instruction and data) need priority! • They stall the processor • Writes can be delayed • At least until the location is needed! • More on intelligent system interface units later • but ...
Cache - Write-through or write-back? • Write-through • Seems a good idea! • but ... • Multiple writes to the same location waste memory bus bandwidth • Typical programsrun better with write-back caches • however • Often you can easily predict which will be best • Some processors (eg PowerPC) allow you to classify memory regions as write-back or write-through
Cache - more bits • Cache lines need some status bits • Tag bits + .. • Valid • All set to false on power up • Set to true as words are loaded into cache • Dirty • Needed by write-back cache • Write- through cache always queues thewrite, so lines are never ‘dirty’
Cache - Improving Performance • Conflicts ( addresses 2k+m bytes apart ) • Degrade cache performance • Lower hit rate • Murphy’s Law operates • Addresses are never random! • Some locations ‘thrash’ in cache • Continually replaced and restored
Cache - Fully Associative • All tags are compared at the same time • Words can use any cache line
Cache - Fully Associative • Associative • Each tag is compared at the same time • Any match hit • Avoids ‘unnecessary’ flushing • Replacement • Least Recently Used - LRU • Needs extra status bits • Cycles since last accessed • Hardware cost high • Extra comparators • Wider tags • p-m bits vsp-k-m bits
Cache - Set Associative 2-way set associative Each line - two words two comparators only
Cache - Set Associative • n-way set associative caches • n can be small: 2, 4, 8 • Best performance • Reasonable hardware cost • Most high performance processors • Replacement policy • LRU choice from n • Reasonable LRU approximation • 1 or 2 bits • Set on access • Cleared / decremented by timer • Choose cleared word for replacement
Cache - Locality of Reference • Temporal Locality • Same location will be referenced again soon • Access same data again • Program loops - access same instruction again • Caches described so far exploit temporal locality • Spatial Locality • Nearby locations will be referenced soon • Next element of an array • Next instruction of a program
Cache - Line Length • Spatial Locality • Use very long cache lines • Fetch one datum • Neighbours fetched also • PowerPC 601 (Motorola/Apple/IBM)first of the single chip Power processors • 64 sets • 8-way set associative • 32 bytes per line • 32 bytes (8 instructions) fetched into instruction buffer in one cycle • 64 x 8 x 32 = 16k byte total
Cache - Separate I- and D-caches • Unified cache • Instructions and Data in same cache • Two caches - • * Instructions * Data • Increases total bandwidth • MIPS R10000 • 32Kbyte Instruction; 32Kbyte Data • Instruction cache is pre-decoded! (32 36bits) • Data • 8-word (64byte) line, 2-way set associative • 256 sets • Replacement policy?