Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented by: Lena Salman

Introduction • Pointer-based data structures are usually randomly allocated in memory Will usually not achieve good locality Higher miss rates • Software approach against Hardware approach

Software approach • Two techniques: • Cache concious allocation - By far the most efficient • Software prefetch – Better suited for automization, better for implementation in compilers. • Combination of both cache-concious allocation and software prefetch, does not add significantly to performance

Hardware approach • Calculating and prefetching pointers • Calculating pointer dependencies • Effects of effectively predicting what to evict from the cache • General HW prefetch – More likely to pollute the cache • Problem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.

Prefetching and cache – concious allocation • Should complement each other’s weakness – • Reduce the prefetch overhead of fetching blocks with partially unwanted data. • Prefetching should reduce the cache misses and miss latencies between the nodes

Cache-conscious allocation • Excellent improvement in execution time performance • Can be adapted to specific need by choosing the cache-conscious block-size (cc – block size) • Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.

Allocation to improve locality

Cache – conscious allocation • Attempts to allocate the data in the same cache- line • Better locality can be achieved • Improved cache performance by a reduction of misses

Does the cache-concious allocation of memory. Takes extra argument – pointer to data structure that is likely to be referenced. #ifdef CCMALOC child=ccmaloc(sizeof(struct node), parent) #else child= malloc(sizeof(struct node)); #endif ccmalloc()

ccmalloc() • Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stucture • Invokes calls to the standard malloc(): • When allocating new cc-block • When data is larger than cc-block • Otherwise: allocate in the empty slot of cc-block

cc-blocks • = Cache – conscious blocks • Demands cache lines large enough to contain more than one pointer structure • The bigger the blocks – the lower the miss-rate if allocation is smart. • Can be set dynamically in software, independently of the HW cache line size. • In our study cc-block size– 256B hardware cache line size – 16B – 256B

Prefetch • Prefetching will reduce the cost of cache – miss • Can be controlled by software and/or hardware • Software results in extra instructions • Hardware leads to complexity in hardware

Software controlled prefetch • Implemented by including prefetch instruction in the instruction set • Should be inserted well ahead of reference, according to prefetch algorithm • In this study: we will use greedy algorithm, by Mowry et al.

Software prefetch – Greedy algorithm • When a node is referenced, it prefetches all children at that node. • Without extra calculation, can only be done to children, not to grandchildren • Easier to control and optimize • The risk of polluting the cache decreases (since prefetch only needed lines)

Software greedy prefetch

Hardware Controlled Prefetch • Depending on the algorithm used, prefetching can occur when a miss is caused • Or when a hint is given by the programmer through an instruction, • Or can always occur on certain types of data

Hardware prefetch Techniques used: • Prefetch on miss • Tagged Prefetch • Attempt to utilize spatial locality • Do NOT analyze data access patterns

Prefetch-on-Miss • Prefetches the next sequantial line i+1, when detecting miss on line i. Line i-1 Line i : Miss! Line i+1 : will be prefetched

Tagged Prefetch • Each prefetched line is tagged with a tag • When a prefetched line - i is referenced, the line i+1 is prefetched. (no miss has occurred) • Efficient when memory is fairly sequential, and has been shown efficient

Pre-fetch on miss – for ccmalloc() • HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.

Prefetch the next line after detecting a cache – miss on a cache-conciously allocated block. Prefetch-one-cc on miss

Prefetch-all-cc on miss • Decides dynamically how many lines to prefetch. • Depends on where on cc-block the missing cache line is located. • Prefetches all the cache lines on the cc-block, from the address causing miss

Experimental Framework • MIPS-like, out-of-order processor simulator. • Memory latency equal to 50 ns random access time. • Benchmarks: • health –simulates columbian health care system • mst –creates graph and calculates minimum span tree • perimeter –calculates the perimeter of image • treeadd –calculates recursive sum of values

More about benchmarks: • health –elements are moved between lists during execution, and there is more calculation between data. • mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable. • perimeter –data allocated in an order similar to access order, resulting locality optimization. • treeadd – has calculation between nodes in a balanced binary tree.

Results: Execution time

Stalls: • Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr. • FU stall– the oldest instr. Is not load / store instr. • Fetch stall– there is no instruction waiting to be retired. • Prefetch is likely to affect when memory stalls are dominant!!

Graphs:

Cache performance - SW • Miss rates are improved by most strategies • Increased spatial locality with ccmalloc() reduces cache misses (less pollution) • Software shows some decrease of misses, but prefetches a lot unused data • Combination of software techniques achieves the lowest rates

Cache performance – cache lines • The larger cache lines the more effective is using ccmalloc() • HW prefetch alone, however, tends to pollute the cache, with unwanted data • SW prefetch alone, tends to bring data already existing in the cache

Cache performance: • SW prefetch achieves higher precision • HW prefetch alone, are no good. • HW prefetch is more sensitive to cache line size than the SW prefetch

Cache performance –SW pref. with ccmalloc() • Results in increased amount of used cache lines, among the prefetched lines • This is caused by increased spatial locality • However! Also results trying prefetching lines already in the cache.

Cache performance –HW prefetch with ccmalloc() • HW are greater improvement with cache-conscious allocation, then on their own, • Prefetch-on-miss and tagged-prefetch both show the same results • Still : large amount of unused prefetched lines • Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch

Conclusions: • The best way still remains cache conscious allocation – ccmalloc() • Efficient to overcome the drawbacks of large cache line • Creates locality necessary for prefetch • The larger the cache line – less prominent the prefetch strategy

Conclusions 2: • Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough • However, ccmalloc() can be used to overcome the negative effect of next-line prefetch • HW prefetch is better then SW prefetch

Conclusions 3: • When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it’s preferable! • However, when profiling is too expensive – will likely to benefit from general prefetch support.

The endddd !!! You can tell me, I can take it.. What’s up doc???

לנה סלמן 28.06.2004

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching