320 likes | 461 Views
Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor. Chen-Yong Cher Michael Gschwind. Automatic Dynamic Garbage Collector. Offload to coprocessor performance benefit host processor keeps running independently BDW implementation offload mark phase to SPE
E N D
Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor • Chen-Yong Cher • Michael Gschwind
Automatic Dynamic Garbage Collector • Offload to coprocessor • performance benefit • host processor keeps running independently • BDW implementation • offload mark phase to SPE • Prefetch and avoid cache misses
Contributions • LS and sync for offloading mark phase • coherency of liveness info b/w host and SPU • MFC-based SW$ to improve performance • hybrid caching schemes for different data types • extend SW$ and Capitulative Loads using MFC DMA
Usefulness • Application space not the right place for such job • Explore LS-based system • low latency • no overhead of cache coherence protocols
BDW garbage collection • Mark- (lazy) sweep • ambiguous roots • DFS for all reachable pointers on the heap • un-marked are de-allocated • lazy sweep: avoid touching large amounts of cold data • preferable caching behavior
Initialization • PPE • offload marking phase to SPE • send effective address to SPE (mailbox) • SPE • indicate ready to receive data • sync for completion of PPE transfer • MFC translates effective address to absolute (Segm. Tables, Page Tables) • Initiate DMA transfer (preferable) • SPE LS -> Memory
Use of Local Store • 20 KB instruction image (GC code) • 128 KB of software cache • 40 KB of header cache • 32 KB of local mark stack • small activation record stack
Porting GC • Porting BDW to SPE • Application heap traversal (bulk of execution) • Sync b/w PPE and SPE via mailbox (small data) • BDW data structures • Mark Stack • Heap Blocks • Header Blocks • Traverse only live references (locality optimization)
Porting GC cont. • Pointer chasing • poor locality on hardware cache and prefetchers • operand buffers to improve locality • fetch entire blocks not just reference • object cache (for HDR blocks) • temporary store of records • hashed system memory effective address • Naive implementation (baseline performance)
Software caches • Non-homogeneous and non-continuous data structures • exploit temporal and spatial locality • Significant overheads on SPU • access latency to compute and compare tags • possible cache miss • locate data buffer • Poor match • for regular and predictive access patterns • Useful for large data sets • statistical locality • Adjusted in size
Software caches cont. • Hybrid design (caching strategy) • SW$ + operand buffer • partition blocks using SW$ or OPB • use SW$ for small heap blocks • reduces hit latency • removes references with dense spatial locality to cache • Intelligent tuning of code generation • take advantage of Cell SPEISA features • ILP/DLP (not covered)
Prefetching • Large sets with poor locality • Hide memory latency (techniques) • Boehm • uses mark stack for prefetching • CHV and Capitulative Loads • use 8-16 entry buffers (FIFO) • CHV • exploits parallel branches • ! DFS traversal • Capitulative Loads • changes access ordering • a demand load on cache hit, a prefetch on miss
Prefetching cont. • Advantage of Cell • initiate prefetch under application control by DMA engine • good for regular dense data sets (tiling matrices) is efficient • GC with irregular data patterns and unpredictable locality is the antithesis • Prefetch of heap blocks for scanning pointers • Locality: addresses are within heap block bounds
Prefetching cont. • Differences b/w prefetching on conventional procs vs. procs with LS • addresses are binding • granularity of prefetching • cost of misspeculation • cost of DMA • Early/late arrival, replacement • not enough work to overlap and hide the miss latency • suffer of pollution effects and overheads • virtual tie
Data coherence and consistency management • Sync b/w PPE and SPE is necessary • application and control code on PPE • bulk of mark phase on SPE • Data are copied • maintain consistency on updates • hdr blocks • lookup indices • application heap
Data coherence and consistency management cont. • Scheme based on data usage • SPE -> PPE only when SPE has completed work • data sync. necessary except from mark stack • Achieve coherency and sync. handled via MFC mailbox • schedule mark operation • PPE sends descriptor with part of the MS and parameters • SPE sends back the parameters via DMA (for ACK) and mark phase begins
Conclusions • Data reference optimized strategy • performance gain 400%-600% • local store based data prefetch • explore local environment • Software controlled cache for garbage collectors • Viable and competitive solution • offload to coprocessor, increase utilization
A Reactive Unobtrusive Prefetcher for Multicore and Manycore Architectures • Jason Mars Daniel Williams • Dan Upton Sudeep Ghosh Kim Hazelwood
Software dynamic prefetchers • Identify complex patterns • high application overheads • Unobtrusive prefetcher • take advantage of neighbor’s underutilized cores • Snooping, profiling, pattern detection, prefetching
URPref • Neighbor idle core observes miss patterns • Analyze miss patterns • continuously profile and adapt • Use Sequitur • pattern detection • Perform prefetch • first identify prefix of hot stream • Reactive • high cache miss rate • Pointer chasing (complex and difficult for hardware)
URPref cont. • Contributions • profile cache misses, detect patterns, prefetch • propose hardware extensions • snooping, etc. • pattern based approach to detect miss patterns, adapt phases
URPref Support • FIFO Snoopy buffer • Profiling on separate core • 2 more ISA instructions • OS must be aware of URPref • Resume, easy halted, suspend, wait, late start • Linear time pattern detection algorithm for miss patterns on cache misses • Sequitur
Detecting Hot Streams with Sequitur • Hot cache miss • length of miss stream • frequency in the input sample window • often large percentage in a small portion • Snoopy sends each new miss to Sequitur • determine if it forms a prefix • if it matches then fetch the remainder hot stream to the neighboring L1 cache
Detecting Hot Streams with Sequitur cont. • Profile window • series of cache misses • Data miss stream • sequence of data cache miss addresses repeats in a profile window • Hot stream • a given percentage of the profiling window • Sequitur builds a context free grammar of the cache miss patterns • each production represents some sequence repeated more than once
Sequitur • Hottest streams used for prefetching • hotness = uses x misses • # times rule used in a grammar (cold uses) • # of terminal symbols • Detecting patterns • actual cache addresses • deltas
Using Sequitur • Determine data b/w adjacent misses • hotness = length x cold uses • sum of the # of terminals for a given non-terminal • # of times the non-terminal appears in the grammar • prefix • first few symbols of the hot stream
Using Sequitur cont. • Fill window of misses from snoopy • dynamic size -> normalize # of hot streams • Keep track of last 4 misses • Active hot stream table • Receive misses from Snoopy • create prefix window • search on active HS table • if a match is found, prefetch it
Prefetching • New hot streams are added to the table • hot streams no longer active are retired • No hot streams found • dynamic prefetcher remains dormant • Hot streams become cold • dynamic prefethcer stops • Avoid conservative approach • Application code doesn’t change • No prefetch instructions are added • no overhead • No sync. b/w prefetcher and app core is needed • no respect to latency when stream of tail is prefetched
Prefetcher: Adaptive Response • Phase change • avoid cache pollution • incorrect prefetching • achieve maximum performance • Continuous profiling (one window latency) • prefetch happens with prefixes from the previous window • ABCDEF • ABCGHI