210 likes | 319 Views
CSE 661 PAPER PRESENTATION. PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al. Presented By SALAMI, Hamza Onoruoiza g201002240. OUTLINE OF PRESENTATION. Throughput Computing Benchmarks Used
E N D
CSE 661 PAPER PRESENTATION PERFORMANCE AND ENERGY IMPLICATIONS OF MANY-CORE CACHES FOR THROUGHPUT COMPUTING By C. J. Hughes et al Presented By SALAMI, Hamza Onoruoiza g201002240
OUTLINE OF PRESENTATION • Throughput Computing • Benchmarks Used • Degree of Sharing of L2 Caches in Benchmarks • Cache Designs Considered • Experimental Setup • Results (Performance and Energy) • Possible Improvements • Final Results • Conclusion, Comments and Questions
THROUGHPUT COMPUTING • Performing huge number of computations with large amounts of parallelism • Also known as GPGPU
L1 Miss rate without prefetching BENCHMARKS USED • Working Set: 64KB – 2MB • 64 Threads with private 32KB Cache • 256KBL2 Cache • L2 < 2MB may result in bad performance
BENCHMARKS USED (2) L1 Miss rate without prefetching
DEGREE OF SHARING OF L2 CACHE IN BENCHMARK SHARING DEGREE • Spatial: Fraction of each cache line accessed. Most data private except for svm • Temporal: Fraction of accesses to line. Shared data is prevalent e.gpcg. 0.1% of lines involved in global R/W giving 19.2% of L2 cache accesses
CACHE DESIGNS CONSIDERED ASSUMPTIONS • Two level caching (private L1, Varying L2); inclusive cache • Directory Based Coherence • Tiled Design (Tile = Core + Private Caches + Switch)
CACHE DESIGNS CONSIDERED (2) 1) PRIVATE LLC • LLC in tile’s core • Most flexible design (replicas of cache line can exist in all LLCs simultaneously) • Fewer unique cache lines => more LLC misses • Each tile contains tag directory • Hash function (cache block address) = home tile • Home tile provides info. On which LLC(s) hold required data. • Cache-to-Cache transfer takes place
CACHE DESIGNS CONSIDERED (3) 2) UNCONTROLLED REPLICATION • Similar to private LLC • Tries to increase no. of unique lines • Eviction of cache block with one sharer? Move block to its home tile. • Already in home tile? Evict from chip.
CACHE DESIGNS CONSIDERED (4) 3) CONTROLLED REPLICATION • Builds on Uncontrolled Replication • Tries to further increase no. of unique lines • Each block has reference bit. • Reference bit = 1 => likely part of working set • Duplicate copies of cache blocks not in active use are favored for LRU eviction.
CACHE DESIGNS CONSIDERED (5) 4) NO REPLICATION • Limited flexibility • Cache lines reside in at mostone LLC at a time . • Shared lines held in lines’ home tile’s LLC (=> easy accessibility) • Private lines held in user’s LLC (RDP points to line’s location). • Eviction of private line or increased number of sharers returns block to its home LLC
CACHE DESIGNS CONSIDERED (6) 5) SHARED • Least flexibility • All cache lines reside in their home LLC. • Easy to find lines • Increased average access latency and on-die traffic for private lines
CACHE DESIGNS CONSIDERED(7) CACHE DESIGNS Private Uncontrolled Replication Controlled Replication No Replication Shared Effective Cache Capacity (No. of Unique Blocks) Flexibility Reduction in On-Die bandwidth usage Reduction in Off-Die bandwidth usage
EXPERIMENTAL SETUP • Simulator is used • L1 has hardware stride prefetcher • Energy Consumption • Storage energy: tag and cache line access to LLC, tag directory and RDP • On-die data messages • On-die coherence messages • Off-die accesses
RESULTS (PERFORMANCE) • Least flexible designs offer better performance!!! • Least flexible designs • High throughput to heavily R/W lines (on a miss, home tile responds directly, no need for acknowledgement) • Single write causes invalidation for readers (less impact for centralized data design, worse for flexible designs) • Flexible designs • No centralized data storage • No overlapped cache-to-cache transfer; directory receives acknowledgement from sending tile before processing another request.
RESULTS (ENERGY) • Flexible designs consume significantly less energy than other designs!!! • Flexible designs minimize on-die traffic because of replication. • Increase in off-die traffic (fewer unique lines) but most lines have few shares. See Figure 1 • On-die traffic for No Replication better than Shared due to data migration • Off-die traffic increases as we move from Private to Uncontrolled Replication to Controlled Replication
RESULTS SO FAR… • Flexible designs are more energy efficient • Less flexible designs offer better performance • Controlled Replication uses least energy. • Can we improve its parallelism for handling multiple reads of the same cache line?
POSSIBLE IMPROVEMENTS • Tag Directory Buffer • Small, fully associative buffer added to tag directory to hold clean lines having at least 3 shared readers (similar to Shared design) • Tag Directory Buffer All • Similar to Tag Directory Buffer • In this case, all read-shared lines are placed in tag directory buffer Four-entry buffer of 256 bytes is used
POSSIBLE IMPROVEMENTS (2) • Sharing Migration • Similar to Tag Directory Buffer • However, uses home tile’s LLC instead of a buffer • Tag Directory Buffer All • Similar to Tag Directory Buffer All • Uses home tile’s LLC instead of a buffer • Parallel Reads • Allows simultaneous (overlapped) cache-to-cache transfers of the same cache lines for reads.
FINAL RESULTS • Tag Directory Buffer provides highest performance and close to the least energy consumption. See also figure 3