190 likes | 252 Views
Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science 2011, April 15th. Stefano Nichele – Angelo Spalluto, 2011. Agenda. Moore’s law – Memory wall Related work Fixed Sequential Prefetching
E N D
Mini-Project Presentation: Prefetching TDT4260 Computer Architecture Stefano Nichele, Angelo Spalluto Department of Computer and Information Science 2011, April 15th Stefano Nichele – Angelo Spalluto, 2011
Agenda • Moore’s law – Memory wall • Related work • Fixed Sequential Prefetching • Sequential Aggressive Prefetching (M-Adaptive, DM-Adaptive) • DCPT, DCPT-P • WA-DCPT and SA-DCPT • Results • Conclusion • References
Moore vs. Mem. Wall • Spatial Locality • Temporal Locality
Prefetching Predicting Fetching 1 – Which data will be needed by the next instructions? 2 – Deliver it into the cache before it is referenced! • Sequential • RPT • PC/DC • DCPT • Adaptive
Fixed Sequential Prefetching • SequentialAlgorithm • The prefetcherissues N requestsafter a miss occurs; • The valueofwindowisconstantfor the wholeexecutionofprogram; • Sequentialbenchmarks • Wupwise; • Applu; • Galgel; • Not sequential benchmarks • Ammp; • Art110; • Art470; Speed up Benchmarks Fixed size window
Sequential Aggressive Adaptive Prefetcher Sequential Aggressive Adaptive The adaptive prefetcher adjusts dynamically the degree of prefetching (N) Adaptive window parameters • Window: Number of N contiguous blocks issued by prefetcher • Accuracy: Number of good prefetches referred to a window • Threshold: Number of good prefetches necessary to increase the window (Accuracy >= Threshold) • Lock window: Number of times whereby the window is locked • Listening state: The prefetcher counts the number of good prefetches Prefetcher algorithm • Prefetcher initialises Window, Threshold and Lock Window • Upon a request issued by CPU, the prefetcher issues N prefetching • It waits for N times (listening state) • In step N it checks if Accuracy >= Threshold • If previous condition is satisfied, then it uses the same window for other L-1 times. Otherwise it decreases the window and it issues N requests (back in step 3) • If step 4 succeedes for L times, the prefetcher increases the window and it issues other N requests • Back in step 3
Different listening states Sequential Aggressive • Prefetching occurs immediately after the last element checked in the window (either if it is a miss or hit) • Each window is composed by P elements = #hits + #misses Miss-Adaptive (M-Adaptive) • The M-Adaptive issues a prefetching (restart a new window) only when the first miss occurs after thatthe whole window has been checked (hits do not trigger prefetching) • Each window is composed by P elements = #hits + #misses Discard Miss-Adaptive (DM-Adaptive) • DM-Adaptive issues a prefetching immediately after the first miss occurs inside the window • Each window is composed by P elements = #hits
DCPT and DCPT-P • No last prefetched • Test if in cache before prefetching • Maybe in the queue
Aggressive Adaptive - DCPT Stefano, Aggressive Adaptive works pretty well with sequential benchmarks. What about DCPT? Great!! DCPT works very goods with not sequential benchmarks. Let’s try to combine them togheter !! Ja ja, we may achieve better results! Aggressive Adaptive DCPT Aggressive Adaptive-DCPT SA-DCPT WA-DCPT
WA-DCPT and SA-DCPT WA-DCPT • WA-DCPT adds the concept of window in DCPT • When DCPT issues a prefetching for a specific PC, it also delivers all subsequent blocks according to its window size • WA-DCPT is more memory demanding than DCPT. It uses a larger data structure SA-DCPT • At runtime it adapts the best algorithm between DCPT and Aggressive Sequential • Switch Threshold is the major concern • Best switch threshold is 4
Adaptive results • Aggressive Adaptive • In some benchmarks (galgel, applu, wupwise) the window reaches also size between 13 and 15 • Using a window greater than 12 does not improve the performances • Low sequencing for ammp, art110 and art470 • M-Adaptive and DM-Adaptive • The results of M-Adaptive and DM-Adaptive are not better than Aggressive Adaptive • As expected, they produce less “misses” and “prefetches issued”
DCPT results • DCPT and DCPT-P • As expected, DCPT-P is slightly better than DCPT • For ammp, DCPT-P outperforms almost twice better than adaptive • Table composed by 16 deltas and 97 PCs is the best configuration (smaller than 8KB) • DCPT-P uses a masking of 8bits • In our tests there are not improvement using a bit mask of 12
Adaptive DCPT results • WA-DCPT • WA-DCPT has a different data structure than DCPT (window data) • Best results are achieved using 14 deltas • SA-DCPT • SA-DCPT has same data structure than DCPT • Tuning on switching threshold • Best switching factor is 4 • SA-DCPT behaves as DCPT for switching factor greater than 4
Developed and Literature prefetcher • DevelopedPrefetchers • DCPT obtains the best performances • SA-DCPT is a good compromise when we do not know the type of benchmark • Literature VS Developed • Our DCPT-P implementation outperforms the reference DCPT-P • Likely because they have different data structure
Coverage Analysis • Coverage • Benchmarks with low sequencing (ammp, art110 and art470) have a higher coverage with DCPT-P • Benchmarks with high sequencing (except applu) have better coverage with SA-DCPT • Coverage vs Speedup • The coverage is not directly proportional to speedup • If the algorithm spends too much time to discover the next element to prefetch, as consequence it might increase its execution time
Conclusion • Importance of prefetcher, it can really improve performances • Contribution: 3 new prefetcher variants: • adaptive window (aggressive technique) • DCPT-based with bit masking • Combination: delta correlation with adaptive window • Importance of parameter tuning • DCPT-P has best performances (on overall) • Difficult to combine two different (opposite) algorithms to exploit the best properties of each
References • G. E. Moore, Cramming more Components onto Integrated Circuits, Electronics, 38(8), April 9, 1965. • W.A. Wulf and S.A. McKee, Hitting the Memory Wall: Implications of the Obvious, Computer Architecture News, vol. 23, no. 1, Mar. 1995, pp. 20–24 • A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, O. O. Storaasli, State-of-the-art in heterogeneous computing, Sci. Program., Vol. 18 (January 2010), pp. 1-33. • M. Jahre, Managing Shared Resources in Chip Multiprocessor Memory Systems.: NTNU 2010 (ISBN 978-82-471-2287-7) 238 s. Doktoravhandlinger ved NTNU (159) • M. Grannaes, Reducing Memory Latency by Improving Resource Utilization.: NTNU 2010 (ISBN 978-82-471-2177-8) 242 s. Doktoravhandlinger ved NTNU (106) • A. J. Smith, Cache memories, ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, 1982 • F. Dahlgren, M. Dubois, and P. Stenstrom. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Parallel Processing, 1993. ICPP 1993. International Conference on, volume 1, pages 56-63, Aug. 1993. • M. Grannaes, M. Jahre and L. Natvig. Multi-level Hardware Prefetching Using Low Complexity Delta Correlating Prediction Tables with Partial Matching. High Performance Embedded Architectures and Compilers LNCS, 2010, Volume 5952/2010, 247-261. • M. Grannaes, M. Jahre and L. Natvig. Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables. In Data Prefetching Championships (2009)