Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor • Chen-Yong Cher • Michael Gschwind

Automatic Dynamic Garbage Collector • Offload to coprocessor • performance benefit • host processor keeps running independently • BDW implementation • offload mark phase to SPE • Prefetch and avoid cache misses

Contributions • LS and sync for offloading mark phase • coherency of liveness info b/w host and SPU • MFC-based SW$ to improve performance • hybrid caching schemes for different data types • extend SW$ and Capitulative Loads using MFC DMA

Usefulness • Application space not the right place for such job • Explore LS-based system • low latency • no overhead of cache coherence protocols

BDW garbage collection • Mark- (lazy) sweep • ambiguous roots • DFS for all reachable pointers on the heap • un-marked are de-allocated • lazy sweep: avoid touching large amounts of cold data • preferable caching behavior

Initialization • PPE • offload marking phase to SPE • send effective address to SPE (mailbox) • SPE • indicate ready to receive data • sync for completion of PPE transfer • MFC translates effective address to absolute (Segm. Tables, Page Tables) • Initiate DMA transfer (preferable) • SPE LS -> Memory

Use of Local Store • 20 KB instruction image (GC code) • 128 KB of software cache • 40 KB of header cache • 32 KB of local mark stack • small activation record stack

Porting GC • Porting BDW to SPE • Application heap traversal (bulk of execution) • Sync b/w PPE and SPE via mailbox (small data) • BDW data structures • Mark Stack • Heap Blocks • Header Blocks • Traverse only live references (locality optimization)

Porting GC cont. • Pointer chasing • poor locality on hardware cache and prefetchers • operand buffers to improve locality • fetch entire blocks not just reference • object cache (for HDR blocks) • temporary store of records • hashed system memory effective address • Naive implementation (baseline performance)

Software caches • Non-homogeneous and non-continuous data structures • exploit temporal and spatial locality • Significant overheads on SPU • access latency to compute and compare tags • possible cache miss • locate data buffer • Poor match • for regular and predictive access patterns • Useful for large data sets • statistical locality • Adjusted in size

Software caches cont. • Hybrid design (caching strategy) • SW$ + operand buffer • partition blocks using SW$ or OPB • use SW$ for small heap blocks • reduces hit latency • removes references with dense spatial locality to cache • Intelligent tuning of code generation • take advantage of Cell SPEISA features • ILP/DLP (not covered)

Prefetching • Large sets with poor locality • Hide memory latency (techniques) • Boehm • uses mark stack for prefetching • CHV and Capitulative Loads • use 8-16 entry buffers (FIFO) • CHV • exploits parallel branches • ! DFS traversal • Capitulative Loads • changes access ordering • a demand load on cache hit, a prefetch on miss

Prefetching cont. • Advantage of Cell • initiate prefetch under application control by DMA engine • good for regular dense data sets (tiling matrices) is efficient • GC with irregular data patterns and unpredictable locality is the antithesis • Prefetch of heap blocks for scanning pointers • Locality: addresses are within heap block bounds

Prefetching cont. • Differences b/w prefetching on conventional procs vs. procs with LS • addresses are binding • granularity of prefetching • cost of misspeculation • cost of DMA • Early/late arrival, replacement • not enough work to overlap and hide the miss latency • suffer of pollution effects and overheads • virtual tie

Data coherence and consistency management • Sync b/w PPE and SPE is necessary • application and control code on PPE • bulk of mark phase on SPE • Data are copied • maintain consistency on updates • hdr blocks • lookup indices • application heap

Data coherence and consistency management cont. • Scheme based on data usage • SPE -> PPE only when SPE has completed work • data sync. necessary except from mark stack • Achieve coherency and sync. handled via MFC mailbox • schedule mark operation • PPE sends descriptor with part of the MS and parameters • SPE sends back the parameters via DMA (for ACK) and mark phase begins

Conclusions • Data reference optimized strategy • performance gain 400%-600% • local store based data prefetch • explore local environment • Software controlled cache for garbage collectors • Viable and competitive solution • offload to coprocessor, increase utilization

A Reactive Unobtrusive Prefetcher for Multicore and Manycore Architectures • Jason Mars Daniel Williams • Dan Upton Sudeep Ghosh Kim Hazelwood

Software dynamic prefetchers • Identify complex patterns • high application overheads • Unobtrusive prefetcher • take advantage of neighbor’s underutilized cores • Snooping, profiling, pattern detection, prefetching

URPref • Neighbor idle core observes miss patterns • Analyze miss patterns • continuously profile and adapt • Use Sequitur • pattern detection • Perform prefetch • first identify prefix of hot stream • Reactive • high cache miss rate • Pointer chasing (complex and difficult for hardware)

URPref cont. • Contributions • profile cache misses, detect patterns, prefetch • propose hardware extensions • snooping, etc. • pattern based approach to detect miss patterns, adapt phases

URPref Support • FIFO Snoopy buffer • Profiling on separate core • 2 more ISA instructions • OS must be aware of URPref • Resume, easy halted, suspend, wait, late start • Linear time pattern detection algorithm for miss patterns on cache misses • Sequitur

Detecting Hot Streams with Sequitur • Hot cache miss • length of miss stream • frequency in the input sample window • often large percentage in a small portion • Snoopy sends each new miss to Sequitur • determine if it forms a prefix • if it matches then fetch the remainder hot stream to the neighboring L1 cache

Detecting Hot Streams with Sequitur cont. • Profile window • series of cache misses • Data miss stream • sequence of data cache miss addresses repeats in a profile window • Hot stream • a given percentage of the profiling window • Sequitur builds a context free grammar of the cache miss patterns • each production represents some sequence repeated more than once

Sequitur • Hottest streams used for prefetching • hotness = uses x misses • # times rule used in a grammar (cold uses) • # of terminal symbols • Detecting patterns • actual cache addresses • deltas

Complex patterns

Using Sequitur • Determine data b/w adjacent misses • hotness = length x cold uses • sum of the # of terminals for a given non-terminal • # of times the non-terminal appears in the grammar • prefix • first few symbols of the hot stream

Using Sequitur cont. • Fill window of misses from snoopy • dynamic size -> normalize # of hot streams • Keep track of last 4 misses • Active hot stream table • Receive misses from Snoopy • create prefix window • search on active HS table • if a match is found, prefetch it

Prefetching • New hot streams are added to the table • hot streams no longer active are retired • No hot streams found • dynamic prefetcher remains dormant • Hot streams become cold • dynamic prefethcer stops • Avoid conservative approach • Application code doesn’t change • No prefetch instructions are added • no overhead • No sync. b/w prefetcher and app core is needed • no respect to latency when stream of tail is prefetched

Prefetcher: Adaptive Response • Phase change • avoid cache pollution • incorrect prefetching • achieve maximum performance • Continuous profiling (one window latency) • prefetch happens with prefixes from the previous window • ABCDEF • ABCGHI

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Cell CG: Using the Cell Synergistic Processor as a Garbage Collection Coprocessor

Presentation Transcript

Cell Structure and Function

Cell Organelle Analogies

CELL COMMUNICATION

ALL LIVING ORGANISMS ARE MADE UP OF CELLS

Cell Structure and function

Cell Structure and Function

Bacterial Cell Structure and Function Part 2: cell envelope and cytoplasmic constituents

Cell Structure and Function

Cell Transport

Cell Culture

cells

Cell Reproduction

The Plasma Membrane -

The Cell Cycle

CELL STRUCTURE

Cell Structure

The Key Roles of Cell Division

Cell Structure and Function

The Cell cycle

cells

BIO 105