Improving Cache Performance

Improving Cache Performance • Four categories of optimisation: • Reduce miss rate • Reduce miss penalty • Reduce miss rate or miss penalty using parallelism • Reduce hit time AMAT = Hit time + Miss rate × Miss penalty

5.5. Reducing Miss Rate • Three sources of misses: • Compulsory • “cold start misses” • Capacity • Cache is full • Conflict • Set is full/block is occupied Increase block size Increase size of cache Increase degree of associativity

Larger Block Size • Bigger blocks reduce compulsory misses • Spatial locality • BUT: • Increased miss penalty • More data to transfer • Possibly increased overall miss rate • More conflict and capacity misses as there are fewer blocks

Transfer Miss penalty Miss rate Access Block size Block size AMAT Block size Effect of Block Size

Larger Caches • Reduces capacity misses • Increases hit time and cost

Higher Associativity • Miss rates improve with higher associativity • Two rules of thumb: • 8-way set associative caches are almost as effective as fully associative • But much simpler! • 2:1 cache rule • A direct mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2

Way Prediction • Set-associative cache predicts which block will be needed on next access to the set • Only one tag check is done • If mispredicted the whole set must be checked • E.g. Alpha 21264 instruction cache • Prediction rate > 85% • Correct prediction: 1 cycle hit • Misprediction: 3 cycles

Pseudo-Associative Caches • Check a direct mapped cache for a hit as usual • If it misses, check a second block • Invert MSB of index • One fast and one slow hit time

Compiler Optimisations • Compilers can optimise code to minimise miss rates: • Reordering procedures • Aligning basic blocks with cache blocks • Reorganising array element accesses

5.6. Reduce Miss Rate or Miss Penalty via Parallelism • Three techniques that overlap instruction execution with memory access

Nonblocking caches • Dynamic scheduling allows CPU to continue with other instructions while waiting for data • Nonblocking cache allows other cache accesses to continue while waiting for data

Hardware Prefetching • Fetch data/instructions before they are requested by the processor • Either into cache or another buffer • Particularly useful for instructions • High degree of spatial locality • UltraSPARC III • Special prefetch cache for data • Increases effectiveness by about four times

Compiler Prefetching • Compiler inserts “prefetch” instructions • Two types: • Prefetch register value • Prefetch data cache block • Can be faulting or non-faulting • Cache continues as normal while data is prefetched

SPARC V9 • Prefetch: prefetch [%rs1 + %rs2], fcn prefetch [%rs1 + imm13], fcn fcn = Prefetch function 0 = Prefetch for several reads 1 = Prefetch for one read 2 = Prefetch for several writes 3 = Prefetch for one write 4 = Prefetch page

5.7. Reducing Hit Time • Critical • Often affects CPU clock cycle time

Small, simple caches • Small usually equals fast in hardware • A small cache may reside on the processor chip • Decreases communication • Compromise: tags on chip, data separate • Direct mapped • Data can be read in parallel with tag checking

Avoiding address translation • Physical caches • Use physical addresses • Address translation must happen before cache lookup • Virtual caches • Use virtual addresses • Protection issues • High context switching overhead

Virtual caches • Minimising context switch overhead: • Add process-identifier tag to cache • Multiple virtual addresses may refer to a single physical address • Hardware enforces anti-aliasing • Software requires less significant bits to be the same

Cache Address CPU Page no. Page offset Offset Tag Index VM Translation Avoiding address translation (cont.) • Choice of page size: • Bigger than cache index + offset • Address translation and tag lookup can happen in parallel

Pipelining cache access • Split cache access into several stages • Impacts on branch and load delays

Trace caches • Blocks follow program flow rather than spatial locality! • Branch prediction is taken into account by cache • Intel NetBurst microarchitecture • Complicates address mapping • Minimises wasted space within blocks

Cache OptimisationSummary • Cache optimisation is very complex • Improving one factor may have a negative impact on another

5.6. Main Memory • Latency and bandwidth are both important • Latency is composed of two factors: • Access time • Cycle time • Two main technologies: • DRAM • SRAM

5.7. Virtual Memory • Physical memory is divided into blocks • Allocated to processes • Provides protection • Allows swapping to disk • Simplifies loading • Historically: • Overlays • Programmer controlled swapping

Terminology • Block: • Page • Segment • Miss: • Page fault • Address fault • Memory mapping (address translation) • Virtual address  physical address

Characteristics • Block size • 4kB – 64kB • Hit time • 50 – 150 cycles • Miss penalty • 1 000 000 – 10 000 000 cycles • Miss Rate • 0.000 01 – 0.001%  

Categorising VM Systems • Fixed block size • Pages • Variable block size • Segments • Difficult replacement • Hybrid approaches • Paged segments • Multiple page sizes (2n× smallest)

Q1: Block placement? • Anywhere in memory • “Fully associative” • Minimises miss rate

Q2: Block identification? • Page/segment number gives physical page address • Paging: offset concatenated • Segments: offset added • Uses a page table • Number of pages in virtual address space • Save space: inverted page table • Number of pages in physical memory

Q3: Block replacement? • Least-recently used (LRU) • Minimises miss rate • Hardware provides a use bit or reference bit

Q4: Write strategy? • Write back • With a dirty bit You won’t become famous by being the first to try write through!

Fast Address Translation • Page tables are big • Stored in memory themselves • Two memory accesses for every datum! • Principle of locality • Cache recent translations • Translation look-aside buffer (TLB), or translation buffer (TB)

Alpha 21264 TLB

Selecting a Page Size • Big • Smaller page table • Allows parallel cache access • Efficient disk transfers • Reduces TLB misses • Small • Less memory wastage (internal fragmentation) • Quicker process startup

Putting it ALL Together! SPARC Revisited

Two SPARCs • SuperSPARC • 1992 • 32-bit superscalar design • UltraSPARC • Late 1990’s • 64-bit design • Graphics support (VIS)

UltraSPARC • Four-way superscalar execution • Two integer ALU’s • FP unit • Five functional units • Graphics unit

Pipeline • 9 stages: • Fetch • Decode • Grouping • Execution • Cache access • Load miss • Integer pipe wait (for FP/graphics pipelines) • Trap resolution • Writeback

Branch Handling • Dynamic branch prediction • Two bit scheme • Every second instruction in cache has prediction bits (predicts up to 2048 branches) • 88% success rate (integer) • Target prediction • Fetches from predicted path

FPU • Five functional units: • Add • Multiply • Divide/square root • Two graphics units (add and multiply) • Mostly fully pipelined (latency 3 cycles) • Except divide and square root (not pipelined, latency is 22 cycles for 64-bit)

Memory Hierarchy • On-chip instruction and data caches • Data: • 16kB direct-mapped, write-through • Instructions: • 16kB 2-way set associative • Both virtually addressed • External cache • Up to 4MB

Virtual Memory • 64-bit virtual addresses  44-bit physical addresses • TLB • 64 entry, fully-associative cache

Multimedia Support (VIS) • Integrated with FPU • Partitioned operations • Multiple smaller values in 64-bits • Video compression instructions • E.g. motion estimation instruction replaces 48 simple instructions for MPEG compression

The End!

Improving Cache Performance