1 / 40

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching. by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented by: Lena Salman. Introduction. Pointer-based data structures are usually randomly allocated in memory

joanna
Download Presentation

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented by: Lena Salman

  2. Introduction • Pointer-based data structures are usually randomly allocated in memory Will usually not achieve good locality Higher miss rates • Software approach against Hardware approach

  3. Software approach • Two techniques: • Cache concious allocation - By far the most efficient • Software prefetch – Better suited for automization, better for implementation in compilers. • Combination of both cache-concious allocation and software prefetch, does not add significantly to performance

  4. Hardware approach • Calculating and prefetching pointers • Calculating pointer dependencies • Effects of effectively predicting what to evict from the cache • General HW prefetch – More likely to pollute the cache • Problem! All the hardware strategies take advantage of the increased locality of cache – consciously allocated data.

  5. Prefetching and cache – concious allocation • Should complement each other’s weakness – • Reduce the prefetch overhead of fetching blocks with partially unwanted data. • Prefetching should reduce the cache misses and miss latencies between the nodes

  6. Cache-conscious allocation • Excellent improvement in execution time performance • Can be adapted to specific need by choosing the cache-conscious block-size (cc – block size) • Attempts to co-allocate data in the same cache-line. Nodes are referenced after each other on the same cache line.

  7. Allocation to improve locality

  8. Cache – conscious allocation • Attempts to allocate the data in the same cache- line • Better locality can be achieved • Improved cache performance by a reduction of misses

  9. Does the cache-concious allocation of memory. Takes extra argument – pointer to data structure that is likely to be referenced. #ifdef CCMALOC child=ccmaloc(sizeof(struct node), parent) #else child= malloc(sizeof(struct node)); #endif ccmalloc()

  10. ccmalloc() • Takes pointer to data that is likely to be referenced close ( in time) to the newly allocated stucture • Invokes calls to the standard malloc(): • When allocating new cc-block • When data is larger than cc-block • Otherwise: allocate in the empty slot of cc-block

  11. cc-blocks • = Cache – conscious blocks • Demands cache lines large enough to contain more than one pointer structure • The bigger the blocks – the lower the miss-rate if allocation is smart. • Can be set dynamically in software, independently of the HW cache line size. • In our study cc-block size– 256B hardware cache line size – 16B – 256B

  12. Prefetch • Prefetching will reduce the cost of cache – miss • Can be controlled by software and/or hardware • Software results in extra instructions • Hardware leads to complexity in hardware

  13. Software controlled prefetch • Implemented by including prefetch instruction in the instruction set • Should be inserted well ahead of reference, according to prefetch algorithm • In this study: we will use greedy algorithm, by Mowry et al.

  14. Software prefetch – Greedy algorithm • When a node is referenced, it prefetches all children at that node. • Without extra calculation, can only be done to children, not to grandchildren • Easier to control and optimize • The risk of polluting the cache decreases (since prefetch only needed lines)

  15. Software greedy prefetch

  16. Hardware Controlled Prefetch • Depending on the algorithm used, prefetching can occur when a miss is caused • Or when a hint is given by the programmer through an instruction, • Or can always occur on certain types of data

  17. Hardware prefetch Techniques used: • Prefetch on miss • Tagged Prefetch • Attempt to utilize spatial locality • Do NOT analyze data access patterns

  18. Prefetch-on-Miss • Prefetches the next sequantial line i+1, when detecting miss on line i. Line i-1 Line i : Miss! Line i+1 : will be prefetched

  19. Tagged Prefetch • Each prefetched line is tagged with a tag • When a prefetched line - i is referenced, the line i+1 is prefetched. (no miss has occurred) • Efficient when memory is fairly sequential, and has been shown efficient

  20. Pre-fetch on miss – for ccmalloc() • HW prefetch can be combined with ccmalloc(), by introducing a hint with address to the beginning of such a block.

  21. Prefetch the next line after detecting a cache – miss on a cache-conciously allocated block. Prefetch-one-cc on miss

  22. Prefetch-all-cc on miss • Decides dynamically how many lines to prefetch. • Depends on where on cc-block the missing cache line is located. • Prefetches all the cache lines on the cc-block, from the address causing miss

  23. Experimental Framework • MIPS-like, out-of-order processor simulator. • Memory latency equal to 50 ns random access time. • Benchmarks: • health –simulates columbian health care system • mst –creates graph and calculates minimum span tree • perimeter –calculates the perimeter of image • treeadd –calculates recursive sum of values

  24. More about benchmarks: • health –elements are moved between lists during execution, and there is more calculation between data. • mst – originally used a locality optimization procedure, which made ccmalloc() non noticeable. • perimeter –data allocated in an order similar to access order, resulting locality optimization. • treeadd – has calculation between nodes in a balanced binary tree.

  25. Results: Execution time

  26. Stalls: • Memory stall – an instruction waits a cycle, due to the oldest instruction waiting to be retired – load / store instr. • FU stall– the oldest instr. Is not load / store instr. • Fetch stall– there is no instruction waiting to be retired. • Prefetch is likely to affect when memory stalls are dominant!!

  27. Graphs:

  28. Cache performance - SW • Miss rates are improved by most strategies • Increased spatial locality with ccmalloc() reduces cache misses (less pollution) • Software shows some decrease of misses, but prefetches a lot unused data • Combination of software techniques achieves the lowest rates

  29. Cache performance – cache lines • The larger cache lines the more effective is using ccmalloc() • HW prefetch alone, however, tends to pollute the cache, with unwanted data • SW prefetch alone, tends to bring data already existing in the cache

  30. Cache performance: • SW prefetch achieves higher precision • HW prefetch alone, are no good. • HW prefetch is more sensitive to cache line size than the SW prefetch

  31. Cache performance –SW pref. with ccmalloc() • Results in increased amount of used cache lines, among the prefetched lines • This is caused by increased spatial locality • However! Also results trying prefetching lines already in the cache.

  32. Cache performance –HW prefetch with ccmalloc() • HW are greater improvement with cache-conscious allocation, then on their own, • Prefetch-on-miss and tagged-prefetch both show the same results • Still : large amount of unused prefetched lines • Unused lines decrease with larger cache lines, due to spatial locality, and lack of need to prefetch

  33. Conclusions: • The best way still remains cache conscious allocation – ccmalloc() • Efficient to overcome the drawbacks of large cache line • Creates locality necessary for prefetch • The larger the cache line – less prominent the prefetch strategy

  34. Conclusions 2: • Cache-conscious allocation with HW prefetch, is not prominent, and it seems that ccmalloc() alone is enough • However, ccmalloc() can be used to overcome the negative effect of next-line prefetch • HW prefetch is better then SW prefetch

  35. Conclusions 3: • When a compiler can use profiling info. and optimize memory allocation in cache-conscious manner – it’s preferable! • However, when profiling is too expensive – will likely to benefit from general prefetch support.

  36. The endddd !!! You can tell me, I can take it.. What’s up doc???

  37. לנה סלמן 28.06.2004

More Related