1 / 37

Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac

XSTREAM : Cross -core S patial Stream ing based MLC Prefetchers for P arallel A pplications in CMPs. Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac.in Indian Institute of Technology Madras, India PACT 2014. Quick Summary.

cady
Download Presentation

Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XSTREAM: Cross-core Spatial Streaming based MLC Prefetchers for Parallel Applications in CMPs Biswabandan Panda,Shankar Balachandran {biswa,shankar}@cse.iitm.ac.in Indian Institute of Technology Madras, India PACT 2014

  2. Quick Summary • Problem: Per-Core Spatial Streaming based MLC prefetchers • Oblivious to the cross-core (XCORE) streams. • Opportunity: • Spatial streams spread across the cores. • Our Contribution: XSTREAM (XCORE Stream Prefetching) • Communication of the streams from one prefetcher to another Just In Time. • Improves Performance with negligible storage: • 11.3(9)% speedup in 4(8) core CMPs with an additional storage of 50KB.

  3. Background on Streams Stream – A sequence of cache-line-aligned memory addresses. Temporal Streams – Sequences of temporally correlated addresses, exploited by TMS. [Wenisch, ISCA ‘05]. Spatial Streams – Streams, which are correlated in space, exploited by SMS [Somogyi, ISCA ‘06]. SpatioTemporal Streams – Temporal correlation among the spatial regions, and spatial correlation within a region, exploited by STeMS [Somogyi, ISCA ‘09].

  4. Streams With Examples Spatial Streams A A + 2 B C A + 3 A + 7 Spatial streams – Scans over fixed data layout. Temporal Streams A A + 2 B C A + 3 A + 7 Temporal streams – Pointer Chasing Codes. Focus of This Talk – Spatial Streaming.

  5. SMS101: Per Core Training • Divides the memory space into fixed size regions, • indexed by a signature (PC/offset) . • Each signature contains a bit vector. • Each bit in the bit vector corresponds to a cache line. Pattern History Table (PHT) Accumulation Table (AT) Tag PC/Offset Bit Vector Sig Bit Vector PC/1 0111 PC/1 0101 . . . Active Generation Table (AGT) . . . Filter Table (FT) PC/1 0111 . . Tag PC/Offset Miss to A+1 A PC/1 1 Miss to A+3 A 2 A Miss to A+2 Eviction/ Invalidation A 3

  6. Baseline Organization Core 0 Core 1 Core N-1 . . . L1 L1 L1 SMS SMS L2 L2 L2 XBar Shared L3 IMC . . . DDR3 K-1 DDR3 0 DDR3 1

  7. XCORE Spatial Streams: An Example Lost Opportunity C D E H EHA B Core 0 Core 0 Demand Misses C DF G A B F G Core 1 Core 1 Time ICORE(Intra-core) Streams – recur within a core. E and H at core 0, F and G at core 1. XCORE Streams – spread&recur across multiple cores. A, B, C, and D.

  8. Prior Works and XSTREAM

  9. Spatial Signatures in a 4-core CMP 80%

  10. Distribution of Responses on an MLC miss 90% Solution : Increase the size of the L1 and L2s. L1: 32 to 128 KB – 1.8 % improvement in Exec. Time. L2: 256KB to 1MB – 8 % improvement in Exec. Time. XSTREAM – 11.3 % with 50KB of additional hardware.

  11. XCORE Signatures - Observations • Observation 1: 80%of the spatial signatures recur in 2 or more cores (XCORE signatures). • Observation 2:Maximum of only 4068 XCORE signatures present in PARSEC that recur at-least 4 times. • Observation 3:Separated by 32Kcycles on an average. Enough time to communicate the streams from one prefetcherto another.

  12. XSTREAM in a NutShell • Cross-core Spatial Prefetching framework: Based on spatial streams. • An MLC prefetchercommunicates (Master Prefetcher) spatial streams to other MLC prefetchers (Worker Prefetchers). • Communication happens Just In Time. • XSTREAM is • a Data forwarding framework. • an Inter-core Prefetching framework.

  13. Ideal XSTREAM 23% & 19%

  14. Working Steps of XSTREAM STEP I : XCORETraining (XSTREAM Detection) Identifies the XCORE signatures and the corresponding master/worker prefetcher. STEP II : XCORE Timeliness (XCORE Timeliness) Finds and stores the difference in time between the recurrence of XCORE streams. STEP III : XCORE Communication (XCORE Comm.) JIT(based on STEP II) Communication of the trained streams from the master to the worker prefetcher.

  15. Shared XSTREAM Detector Entry 0 Entry K-1 . . . Sig 0 . . Done Master Time BV Done Master Time BV . Sig S-1 • BV : Bit Vector • Time (When) : The time at which the BV is inserted. • Master (Who) : Core-Id of the prefetcher who has sent the BV. • Done: Whether the entry has already participated in the XSTREAM detection step. Sig Done Master Time BV t PC1 0 t 0 1 1 1 0 0 0

  16. Enhanced Per Core PHT Pattern History Table (PHT0) XSTREAM : Pattern History Table (PHT0) Sig BV Sig BV Worker Time Init PC/1 0011 0 PC/1 0011 1 δ . . . . . . . . . . . . . . . . . . . . . • Worker : Core-Id of the prefetcher in which the signature will recur. • Time : The time gap after which the signature will recur at the worker. • Init : Whether the master has initiated the comm.

  17. Baseline + XSTREAM XBuff-S : Shared XBuffer An Interface Between SMS and XSTREAM detector. Stores the BV and the Core-Id of the prefetcher. Core 0 Core 1 Core N-1 f0 , Vdd . . . L1 L1 L1 XBuff-P XBuff-P SMS L2 L2 L2 SMS . . . . . . XSTREAM uses the clock domain which drives the LLC. This clock domain is oblivious to P(C) states. XBar XBuff-S . . . XBuff-P : Private Xbuffer (Per Core) Buffers the BV sent by the master prefetcher and the predicted worker which will use the BV in future. XSTREAM Detector Shared L3 f1, Vdd 17

  18. XSTREAM Detection in a 2-core CMP 2 Core 0 PHT 1 PHT 0 Core 1 Sig BV1 Worker Time Init Sig BV0 Worker Time Init 0 0 1 1 0 1 1 1 PC/1 0 0 PC/1 AGT0 AGT1 XBuff-S XBuff-S . BV0 BV1 . . . BV0 . . BV1 XSTREAM Detector Done Master Time BV Sig 0 0 t 0 1 1 1 PC/1 0 1 0 0 1 1 t+δ 1

  19. XSTREAM Detection - 2 4 XSTREAM Detector PHT 0 Core 0 Sig BV Worker Time Init Sig Done Master Time BV PC/1 0 1 1 1 1 δ 0 1 0 0 t 0 1 1 1 PC/1 0 1 t+δ 0 0 1 1 #1s >= CCth Y 3

  20. XCORE Communication PHT 0 Core 0 PHT 1 Core 1 Sig BV0 Worker Time Sig BV1 Init Worker Time Init 1 PC/1 1 0 0 1 δ PC/1 0 1 1 0 1 0 0 t 1 4 BV1U BV0 AGT0 3 t + δ 1 1 1 1 BV0 5 BV0 . XBuff-P PFQ . . . . . BV0 . XBuff-S . . XBar BV0 2 XSTREAM Detector Sig Done Master Time BV t 0 0 1 00 1 PC/1

  21. XCORE Timeliness - Implementation XCORE communication depends on the accurate prediction of the time difference ( ). δ Local/global registers at the PHTs/XSTREAM detector, store a 4-bit encoded cycle value [cycle/4000]. estimates are tuned to minimize the noise from the Interconnect. δ On an average, these estimates are accurate for 67% of the time.

  22. Interconnect Support (XSTREAM Transactions) X-Request : Between an MLC prefetcher and the shared XSTREAM detector. Contents : BV with the Core-Id of the Master. X-Response : Between XSTREAM detector and the master prefetcher. Contents :Time Field with the Core-Id of the Worker. X-Comm : Between the Master and the Worker. Contents : BV and the Worker.

  23. Simulation Methodology

  24. Parameters Specific to XSTREAM

  25. Speedup in a 4-core CMP <=4% > 4% & <= 13% > 13 % & <= 29% 11.3%

  26. Speedup in an 8-core CMP < 2% > 2% & <=10% > 10% & < 30% 9.0%

  27. 4-core to 8-core dedup and ferret spawn 2+3n and 2+4n threads in an n-core system. fluidanimate, freqmine, and vips – slowest thread limits the performance. X264 and streamcluster – degree of XCORE comm. increases with the increase in the core count.

  28. Storage Overhead (4 core CMP) XSTREAM incurs an overhead which is a little more than 1/6 thof a single MLC (256KB).

  29. Summary of Evaluations • XSTREAM consumes • 18% of the spared interconnect bandwidth • (which is 57% of the theoretical limit). • 7.23GB/sec (average) DRAM bandwidth.

  30. Other Results in the Paper • Sensitivity study with various cache sizes of L1/L2/L3. • Scalability • Scales well for 16-core CMP too (10% Improvement). • Quantitative Analysis of each PARSEC benchmark. • Special Cases and specific issues. • Analysis of Bandwidth/DRAM Traffic. • Detailed Calculation of the Hardware Overhead • 2.3% increase in the L2 cache area.

  31. Conclusion • A new Spatial Streaming mechanism for private MLCs. • Key Idea: Communication of spatial streams from one prefetcher to another. • Key properties: • Low Hardware Cost • Simple and Practical hardware implementation • Just In Time Communication • Improvesexecution time • Outperforms state-of-the-art spatial streaming technique by 11.3%( 9%) in 4 (8) core CMPs respectively.

  32. Thank You This work is supported by IBM India Shared University Research (SUR) Grant and a Ph.D. Fellowship from Tata Consultancy Services.

  33. Backup Slides

  34. Prefetch Metrics – 4-core Acc, Cov – Higher the better PF Traffic – Lower the better

  35. Prefetch Metrics – 8-core Acc, Cov – Higher the better PF Traffic – Lower the better

  36. Reduction in Demand Miss Rate (4-core)

  37. Reduction in Demand Miss Rate (8-core) 37

More Related