Introduction

StampedeA Cluster Programming Middleware for Interactive Stream-oriented ApplicationsUmakishore Ramachandran, Rishiyur Nikhil, James Matthew Rehg, Yavor Angelov, Arnab Paul, Sameer Adhikari, Kenneth Mackenzie, Nissim Harel, Kathleen KnobeIEEE Transactions on Parallel and Distributed Systems, November 2003

Introduction New application domains: interactive vision, multimedia collaboration, animation • Interactive • Process temporal data • High computational requirements • Exhibit task & data parallelism • Dynamic – unpredictable at compile time Stampede: programming system to enable execution on SMPs/clusters • Support for task, data parallelism • Temporal data handling, buffer management • High level data sharing: space-time memory

Example: Smart Kiosk Public device for providing information, entertainment • Interact with multiple people • Capable of initiating interaction • I/O: video cameras, microphones, touch screens, infrared, speakers, …

Kiosk application characteristics Tasks have different computational requirements • higher level tasks may be more expensive • May not run as often – data dependent Multiple (heterogeneous) time correlated data sets Tasks have different priorities e.g., interacting with customer vs. looking for new customers Input may not be accessed in strict order • e.g., skip all but most recent data • May need to re-analyze earlier data Claim: streams, lists not expressive enough

Space time memory Distributed shared data structures for temporal data • STM channel: random access • STM queue: FIFO access • STM register: cluster-wide shared variable Unique system wide names Threads attach, detach dynamically Threads communicate only via STM

STM channels

Channels supports bounded/unbounded size Separate API for typed access, hooks for marshalling, unmarshalling Timestamp wildcards Request newest/oldest item in channel Newest value not previously read Get/put Blocking/nonblocking operation Timestamps can be out of order Copy-in, copy-out semantics Get can be called on an item 0-#conn times STM channel API

STM queue Supports data parallelism Get/put behave as enqueue/dequeue • Get: items retrieved exactly once • Put: multiple items w/same timestamp can be added • Used for partitioning data items (regions in frame) • Runtime adds ticket for unique id

Garbage collection How to determine if an STM item is no longer needed? Consume API call indicates this for a connection Queues • Items have implicit reference count of 1 • GC after consume Channels • Number of consumers unknown • Threads can skip items • New connections can be created dynamically • Reachability via timestamps • GC if item cannot be accessed by any current or future connection • System: item not GCed until marked consumed by all connections • Application: must mark each item consumed (can mark timestamp ranges)

GC and timestamps Threads propagate input timestamps to output Threads at data source (e.g. camera) generate timestamps Virtual time: per thread, application specific (e.g. frame number) Visibility: per-thread, minimum of virtual time & item timestamps from all connections • Put: item timestamp >= visibility • Create thread: child virtual time >= visibility • Attach: items < visibility implicitly consumed • Set virtual time: any value >= visibility. Infinity or must guarantee advancement Global minimum timestamp, ts_min. Minimum of: • Virtual time of all threads • Timestamps of items on all queues • Timestamps of unconsumed items on all input connections of all channels Items with timestamps < ts_min can be garbage collected

Code samples

Model 1 Model 2 Application: color-based tracking People tracker for Smart Kiosk • Track multiple moving targets based on color • Goals: low latency, keep up with frame rate

Mapping to Stampede Expected bottleneck: target detection Data parallelize by color models, frame regions (horizontal stripes) • Placement on cluster 1 node: all threads except inner DPS N nodes: 1 inner DPS each

Setup: 17 node cluster (Dell 8450s) 8 CPUs/node: 550 MHz P3 Xeon 4 GB memory/node 2 MB L2 cache/CPU Gigabit ethernet OS: Linux Stampede used CLF messaging Data: 1 MB/frame @ 30 fps, 8 models Bottleneck was histogram thread Color tracking results

Application: video textures Batch video processing: generate video loop from set of frames • Randomly transition between computed cut points,or create loop of specified length • Calculate best places to cut – pairwise frame comparison • Comparisons independent – lots of parallelism • Problem: data distribution –don’t send every frame everywhere

Mapping to Stampede Cluster nodes

Decentralized data distribution “tiling with chaining” fetches a subset and reuses images Fetches all images

Stripe size experiment Tune image comparison for L2 cache size • Compare image regions rather than whole images • Find stripe size (#rows) s.t. comparisons fit in cache • Measure single node speedup as a function of stripe size, number of worker threads Setup: cluster as before Data: 316 frames, 640x480, 24 bit color (~900KB) comparisons = N(N-1)/2 = 49770

Whole image comparison (seconds) Memory bottleneck Stripe size results

Data distribution experiment Single-source vs. decentralized data distribution • Measure speedup as a function of nodes, threads/node • Tile size varies with number of nodes • Larger tiles: better compute/communication ratio • Smaller tiles: better load balancing • Compare to algorithm-limited speedup • no communication costs • shows effect of load imbalances Setup: as before Full image comparisons

Data distribution results Single source bottleneck – as #nodes ↑, communication time > computation time 1-thread vs. 8-thread performance: communication for initial tile fetchno computation overlap

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction