1 / 21

Introduction

brook
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. StampedeA Cluster Programming Middleware for Interactive Stream-oriented ApplicationsUmakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth Mackenzie, Nissim Harel, Kathleen KnobeIEEE Transactions on Parallel and Distributed Systems, November 2003

  2. Introduction New application domains: interactive vision, multimedia collaboration, animation • Interactive • Process temporal data • High computational requirements • Exhibit task & data parallelism • Dynamic – unpredictable at compile time Stampede: programming system to enable execution on SMPs/clusters • Support for task, data parallelism • Temporal data handling, buffer management • High level data sharing: space-time memory

  3. Example: Smart Kiosk Public device for providing information, entertainment • Interact with multiple people • Capable of initiating interaction • I/O: video cameras, microphones, touch screens, infrared, speakers, …

  4. Kiosk application characteristics Tasks have different computational requirements • higher level tasks may be more expensive • May not run as often – data dependent Multiple (heterogeneous) time correlated data sets Tasks have different priorities e.g., interacting with customer vs. looking for new customers Input may not be accessed in strict order • e.g., skip all but most recent data • May need to re-analyze earlier data Claim: streams, lists not expressive enough

  5. Space time memory Distributed shared data structures for temporal data • STM channel: random access • STM queue: FIFO access • STM register: cluster-wide shared variable Unique system wide names Threads attach, detach dynamically Threads communicate only via STM

  6. STM channels

  7. Channels supports bounded/unbounded size Separate API for typed access, hooks for marshalling, unmarshalling Timestamp wildcards Request newest/oldest item in channel Newest value not previously read Get/put Blocking/nonblocking operation Timestamps can be out of order Copy-in, copy-out semantics Get can be called on an item 0-#conn times STM channel API

  8. STM queue Supports data parallelism Get/put behave as enqueue/dequeue • Get: items retrieved exactly once • Put: multiple items w/same timestamp can be added • Used for partitioning data items (regions in frame) • Runtime adds ticket for unique id

  9. Garbage collection How to determine if an STM item is no longer needed? Consume API call indicates this for a connection Queues • Items have implicit reference count of 1 • GC after consume Channels • Number of consumers unknown • Threads can skip items • New connections can be created dynamically • Reachability via timestamps • GC if item cannot be accessed by any current or future connection • System: item not GCed until marked consumed by all connections • Application: must mark each item consumed (can mark timestamp ranges)

  10. GC and timestamps Threads propagate input timestamps to output Threads at data source (e.g. camera) generate timestamps Virtual time: per thread, application specific (e.g. frame number) Visibility: per-thread, minimum of virtual time & item timestamps from all connections • Put: item timestamp >= visibility • Create thread: child virtual time >= visibility • Attach: items < visibility implicitly consumed • Set virtual time: any value >= visibility. Infinity or must guarantee advancement Global minimum timestamp, ts_min. Minimum of: • Virtual time of all threads • Timestamps of items on all queues • Timestamps of unconsumed items on all input connections of all channels Items with timestamps < ts_min can be garbage collected

  11. Code samples

  12. Model 1 Model 2 Application: color-based tracking People tracker for Smart Kiosk • Track multiple moving targets based on color • Goals: low latency, keep up with frame rate

  13. Mapping to Stampede Expected bottleneck: target detection Data parallelize by color models, frame regions (horizontal stripes) • Placement on cluster 1 node: all threads except inner DPS N nodes: 1 inner DPS each

  14. Setup: 17 node cluster (Dell 8450s) 8 CPUs/node: 550 MHz P3 Xeon 4 GB memory/node 2 MB L2 cache/CPU Gigabit ethernet OS: Linux Stampede used CLF messaging Data: 1 MB/frame @ 30 fps, 8 models Bottleneck was histogram thread Color tracking results

  15. Application: video textures Batch video processing: generate video loop from set of frames • Randomly transition between computed cut points,or create loop of specified length • Calculate best places to cut – pairwise frame comparison • Comparisons independent – lots of parallelism • Problem: data distribution –don’t send every frame everywhere

  16. Mapping to Stampede Cluster nodes

  17. Decentralized data distribution “tiling with chaining” fetches a subset and reuses images Fetches all images

  18. Stripe size experiment Tune image comparison for L2 cache size • Compare image regions rather than whole images • Find stripe size (#rows) s.t. comparisons fit in cache • Measure single node speedup as a function of stripe size, number of worker threads Setup: cluster as before Data: 316 frames, 640x480, 24 bit color (~900KB) comparisons = N(N-1)/2 = 49770

  19. Whole image comparison (seconds) Memory bottleneck Stripe size results

  20. Data distribution experiment Single-source vs. decentralized data distribution • Measure speedup as a function of nodes, threads/node • Tile size varies with number of nodes • Larger tiles: better compute/communication ratio • Smaller tiles: better load balancing • Compare to algorithm-limited speedup • no communication costs • shows effect of load imbalances Setup: as before Full image comparisons

  21. Data distribution results Single source bottleneck – as #nodes ↑, communication time > computation time 1-thread vs. 8-thread performance: communication for initial tile fetchno computation overlap

More Related