370 likes | 509 Views
Scheduling Streaming Computations. Kunal Agrawal. The Streaming Model. Computation is represented by a directed graph: Nodes: Computation Modules. Edges: FIFO Channels between nodes. Infinite input stream. We only consider acyclic graphs ( dags ).
E N D
Scheduling Streaming Computations Kunal Agrawal
The Streaming Model • Computation is represented by a directed graph: • Nodes: Computation Modules. • Edges: FIFO Channels between nodes. • Infinite input stream. • We only consider acyclic graphs (dags). • When modules fire, they consume data from incoming channels and produce data on outgoing channels.
Cache-Conscious Scheduling of Streaming Applications • Goal: Schedule the computation to minimize the number of cache misses on a sequential machine. with Jeremy T. Fineman, Jordan Krage, Charles E. Leiserson, and Sivan Toledo
Disk Access Model cache Slow Memory • The cache has M/B blocks each of size B. • Cost = Number of cache misses. • If CPU accesses data in cache, the cost is 0. • If CPU accesses data not in cache, then there is a cache miss of cost 1. The block containing the requested data is read into cache. • If the cache is full, some block is evicted from cache to make room for new blocks. M/B CPU block B
Contributions • The problem of minimizing cache misses is reduced to a problem of graph partitioning. • Theorem: If the optimal algorithm has X cache misses given a cache of size M, there exists a partitioned schedule that incurs O(X) cache misses given a cache of size O(M). • In other words, some partitioned schedule is O(1) competitive given O(1) memory augmentation.
Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts
Streaming Applications b i:4 s:20 o:2 a d o:2 i:1 i:1 o:1 s:60 When a module vfires, it • must load s(v) state, • consumes i(u,v) items from incoming edge(s) (u,v), and • produces o(v,w) items on outgoing edge(s) (v,w). s:35 o:4 i:1 s:40 i:4 o:1 c • Assumptions: • All items are unit sized. • The source consumes 1 item each time it fires. • Input/output rates and state sizes are known. • The state size of modules is at most M.
Definition: Gain b gain: 1/2 i:4 s:20 o:2 a d o:2 i:1 i:1 o:1 s:60 Vertex Gain: Number of vertex u firings per source firing. , where p is a path from s to u. s:35 o:4 i:1 s:40 gain: 1 i:4 o:1 c gain: 1 Edge Gain: The number of items produced along the edge (u,v) per source firing. A graph is well-formed iff all gains are well-defined.
Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts
Cache Misses Due to State Load s:40 s:60 s:20 s:35 1 4 1 2 1 1 8 1 Strategy: Push items through. Cost Per Input Item: The sum of the state sizes Idea: Reuse the state once loaded. B:1, M:100 Cache Slow Memory
Cache Misses Due to Data Items s4:35 s3:40 s1:60 s2:20 1 4 1 2 1 1 8 1 Strategy: Once loaded, execute module many times by adding large buffers between modules. Cost Per Input Item: Total number of items produced on all channels per input item B:1, M:100 Cache Slow Memory
Partitioning: Reduce Cache Misses s3:40 s1:60 s2:20 s4:35 1 4 1 2 1 1 8 1 Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions. Cost Per Input Item: B:1, M:100 Cache Slow Memory
Which Partition? s3:40 s1:60 s2:20 s4:35 1 4 1 2 1 1 8 1 Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions. Cost Per Input Item: B:1, M:100 Cache Slow Memory Lesson: Cut small gain edges.
Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts
Is Partitioning Good? • Show that the optimal scheduler can not do much better than the best partitioned scheduler. • Theorem: On processing T items, if the optimal algorithm given M-sized cache has X cache misses, then some partitioning algorithm given O(M) cache has at most O(X) cache misses. • The number of cache misses due to a partitioned scheduler is The best partitioned scheduler should minimize • We must prove the matching lower bound on the optimal scheduler’s cache misses.
Optimal Scheduler With Cache M S: segment with state size at least 2M. e = gm(S): the edge with the minimum gain within S. u v e S • u fires X times. • Case 1: At least 1 item produced by u is processed by v. • Cost • Case 2: All items are buffered within S. • The cheapest place to buffer is at e. • Cost • If , Cost • In both cases, Cost/firing ofu
Lower Bound • Divide the pipeline into segments of size between 2M and 3M. • Source node fires T times. • Consider the optimal scheduler with M cache. • Number of firings of ui • Cost due to Si per firing of ui • Total cost due to Si • Total Cost over all segments ui vi ek ei e1 Si
Matching Upper Bound • Divide the pipeline into segments of size between 2M and 3M. • Source node fires T times. • Cost of optimal scheduler with M cache • Consider the partitioned schedule that cuts all ei. • Each segment has size at most 6M. • The total cost of that schedule is • Therefore, if this partitioned schedule has constant factor memory augmentation, it provides constant-competitiveness in the number of cache misses. ui vi ek ei e1 Si
Generalization to DAG • Say we partition a DAG such that • Each component has size at most O(M). • When contracted, the components form a dag.
Generalization to DAG • Say we partition a DAG such that • Each component has size at most O(M). • When contracted, the components form a dag. • If C is the set of cross edges, is minimized over all such partitions. • The optimal schedule has cost/item . • Given constant factor memory augmentation, a partitioned schedule has cost/item .
When B ≠ 1 • Lower Bound: The optimal algorithm has cost • Upper Bound: With constant factor memory augmentation: • Pipelines: Upper bound matches the lower bound. • DAGs: Upper bound matches the lower bound as long as each component of the partition has O(M/B) incident cross edges.
Finding A Good Partition • For pipelines, we can find a good-enough partition greedily and the best partition using dynamic programming. • For general DAGs, finding the best partition is NP-complete. • Our proof is approximation-preserving. An approximation algorithm for the partitioning problem, will work for our problem.
Conclusions and Future Work • We can reduce the problem of minimizing cache misses to the problem of calculating the best partition. • Solving the partitioning problem: • Approximation algorithms. • Exact solution for special cases such as SP-DAGs. • Space bounds: Bound the buffer sizes on cross edges. • Cache-conscious scheduling for multicores.
Deadlock Avoidance for Streaming Computations with Filtering • Goal: Devise mechanisms to avoid deadlocks on applications with filtering and finite buffers. with Peng Li, Jeremy Buhler, and Roger D. Chamberlain
Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts
Filtering Applications Model • Data dependent filtering: The number of items produced depends on the data. • When a node fires, it • has a compute index (CI), which monotonically increases, • consumes/produces 0 or 1items from input/output channels, • input/output items must have index = CI. • A node can not proceed until it is sure that it has received allitems of its current CI. • Channels can have unbounded delays. 3 1 2 U A X 3 2 1 1 2 B 2 1 Y 3 2 1 C Compute index 1 Aitemwith index 1
A Deadlock Demo Filtering can cause deadlocks due to finite buffers. v 1 2 3 4 3 5 2 full 6 full 1 u x empty empty 4 3 w • A deadlock example (channel buffer size is 3).
Contributions • Deadlock avoidance mechanism using dummy or heartbeatmessages sent at regular intervals • Provably correct --- guarantees deadlock freedom. • No global synchronization. • No dynamic buffer resizing. • Efficient algorithms to compute dummy intervals for structured DAGs such as series parallelDAGs and CS4DAGs
Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts
The Naïve Algorithm • Filtering Theorem • If no node ever filters any token, then the system cannot deadlock • The Naïve Algorithm • Sends a dummy on every filtereditem. • Changes a filtering system to a non-filtering system. u 2 1 2 1 A X 1 A token with index 1 1 A dummy with index 1
Comments on the Naïve Algorithm • Pros • Easy to schedule dummy items • Cons • Doesn’t utilize channel buffer sizes. • Sends many unnecessary dummy items, wastingboth computation and bandwidth. • Next step, reduce thenumberof dummy items.
The Propagation Algorithm • Computes a static dummy schedule. • Sends dummies periodically based on dummy intervals. • Dummy items mustbe propagated to all downstream nodes. v 4 3 3 2 5 6 2 5 6 3, ∞ 4 1 3, 8 1 Dummy interval u x Channel buffer size 4, ∞ 4, 6 4 3 6 6 w Comp. Index: 6 Index of last dummy: 0 6 – 0 >= 6, send a dummy
Comments on the Propagation Algorithm • Pros • Takes advantage of channel buffer sizes. • Greatly reduces the number of dummy items compared tothe Naïve Algorithm. • Cons • Does not utilize filtering history. • Dummy items mustbe propagated. • Next step, eliminate propagation • Use shorter dummy intervals. • Use filtering history for dummy scheduling.
The Non-Propagation Algorithm • Send dummy items based on filtering history • Dummy items do not propagate. • If (index offiltered item– index ofprevioustoken/dummy) >= dummy interval, send a dummy v 4 3 3 2 5 2 5 6 3, 4 1 4 3, 4 1 Dummy interval u x Channel buffer size 4, 3 4, 3 Data filtered Current Index: 3 Index of last token/dummy: 0 3 – 0 >= 3, send a dummy 3 3 4 w
Comparison of the Algorithms • Performance measurement • # of dummies sent • Fewer dummies are better • Non-Propagation Algorithm is expected to be the best in most cases • Experimental data • Mercury BLASTN (biological app.) • 787 billion input elements
How Do we Compute These Intervals • Exponential time algorithms for general DAGs, since we have to enumerate cycles. • Can we do better for structured DAGs? • Yes. • Polynomial time algorithms for SP DAGs • Polynomial time algorithms for CS4 DAGs --- a class of DAGs where every undirected cycle has a single source and a single sink.
Conclusions and Future Work • Designed efficient deadlock-avoidance algorithms using dummy messages. • Find polynomial algorithms to compute dummy intervalfor general DAGs. • Consider general models: allowing multiple outputs from one input and feedback loops. • The reverse problem: computing efficient buffer sizes from dummy intervals.