1 / 37

Scheduling Streaming Computations

Scheduling Streaming Computations. Kunal Agrawal. The Streaming Model. Computation is represented by a directed graph: Nodes: Computation Modules. Edges: FIFO Channels between nodes. Infinite input stream. We only consider acyclic graphs ( dags ).

devaki
Download Presentation

Scheduling Streaming Computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheduling Streaming Computations Kunal Agrawal

  2. The Streaming Model • Computation is represented by a directed graph: • Nodes: Computation Modules. • Edges: FIFO Channels between nodes. • Infinite input stream. • We only consider acyclic graphs (dags). • When modules fire, they consume data from incoming channels and produce data on outgoing channels.

  3. Cache-Conscious Scheduling of Streaming Applications • Goal: Schedule the computation to minimize the number of cache misses on a sequential machine. with Jeremy T. Fineman, Jordan Krage, Charles E. Leiserson, and Sivan Toledo

  4. Disk Access Model cache Slow Memory • The cache has M/B blocks each of size B. • Cost = Number of cache misses. • If CPU accesses data in cache, the cost is 0. • If CPU accesses data not in cache, then there is a cache miss of cost 1. The block containing the requested data is read into cache. • If the cache is full, some block is evicted from cache to make room for new blocks. M/B CPU block B

  5. Contributions • The problem of minimizing cache misses is reduced to a problem of graph partitioning. • Theorem: If the optimal algorithm has X cache misses given a cache of size M, there exists a partitioned schedule that incurs O(X) cache misses given a cache of size O(M). • In other words, some partitioned schedule is O(1) competitive given O(1) memory augmentation.

  6. Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts

  7. Streaming Applications b i:4 s:20 o:2 a d o:2 i:1 i:1 o:1 s:60 When a module vfires, it • must load s(v) state, • consumes i(u,v) items from incoming edge(s) (u,v), and • produces o(v,w) items on outgoing edge(s) (v,w). s:35 o:4 i:1 s:40 i:4 o:1 c • Assumptions: • All items are unit sized. • The source consumes 1 item each time it fires. • Input/output rates and state sizes are known. • The state size of modules is at most M.

  8. Definition: Gain b gain: 1/2 i:4 s:20 o:2 a d o:2 i:1 i:1 o:1 s:60 Vertex Gain: Number of vertex u firings per source firing. , where p is a path from s to u. s:35 o:4 i:1 s:40 gain: 1 i:4 o:1 c gain: 1 Edge Gain: The number of items produced along the edge (u,v) per source firing. A graph is well-formed iff all gains are well-defined.

  9. Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts

  10. Cache Misses Due to State Load s:40 s:60 s:20 s:35 1 4 1 2 1 1 8 1 Strategy: Push items through. Cost Per Input Item: The sum of the state sizes Idea: Reuse the state once loaded. B:1, M:100 Cache Slow Memory

  11. Cache Misses Due to Data Items s4:35 s3:40 s1:60 s2:20 1 4 1 2 1 1 8 1 Strategy: Once loaded, execute module many times by adding large buffers between modules. Cost Per Input Item: Total number of items produced on all channels per input item B:1, M:100 Cache Slow Memory

  12. Partitioning: Reduce Cache Misses s3:40 s1:60 s2:20 s4:35 1 4 1 2 1 1 8 1 Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions. Cost Per Input Item: B:1, M:100 Cache Slow Memory

  13. Which Partition? s3:40 s1:60 s2:20 s4:35 1 4 1 2 1 1 8 1 Strategy: Partitioninto segments that fit in cache and only add buffers on cross edgesC --- edges that go between partitions. Cost Per Input Item: B:1, M:100 Cache Slow Memory Lesson: Cut small gain edges.

  14. Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts

  15. Is Partitioning Good? • Show that the optimal scheduler can not do much better than the best partitioned scheduler. • Theorem: On processing T items, if the optimal algorithm given M-sized cache has X cache misses, then some partitioning algorithm given O(M) cache has at most O(X) cache misses. • The number of cache misses due to a partitioned scheduler is The best partitioned scheduler should minimize • We must prove the matching lower bound on the optimal scheduler’s cache misses.

  16. Optimal Scheduler With Cache M S: segment with state size at least 2M. e = gm(S): the edge with the minimum gain within S. u v e S • u fires X times. • Case 1: At least 1 item produced by u is processed by v. • Cost • Case 2: All items are buffered within S. • The cheapest place to buffer is at e. • Cost • If , Cost • In both cases, Cost/firing ofu

  17. Lower Bound • Divide the pipeline into segments of size between 2M and 3M. • Source node fires T times. • Consider the optimal scheduler with M cache. • Number of firings of ui • Cost due to Si per firing of ui • Total cost due to Si • Total Cost over all segments ui vi ek ei e1 Si

  18. Matching Upper Bound • Divide the pipeline into segments of size between 2M and 3M. • Source node fires T times. • Cost of optimal scheduler with M cache • Consider the partitioned schedule that cuts all ei. • Each segment has size at most 6M. • The total cost of that schedule is • Therefore, if this partitioned schedule has constant factor memory augmentation, it provides constant-competitiveness in the number of cache misses. ui vi ek ei e1 Si

  19. Generalization to DAG • Say we partition a DAG such that • Each component has size at most O(M). • When contracted, the components form a dag.

  20. Generalization to DAG • Say we partition a DAG such that • Each component has size at most O(M). • When contracted, the components form a dag. • If C is the set of cross edges, is minimized over all such partitions. • The optimal schedule has cost/item . • Given constant factor memory augmentation, a partitioned schedule has cost/item .

  21. When B ≠ 1 • Lower Bound: The optimal algorithm has cost • Upper Bound: With constant factor memory augmentation: • Pipelines: Upper bound matches the lower bound. • DAGs: Upper bound matches the lower bound as long as each component of the partition has O(M/B) incident cross edges.

  22. Finding A Good Partition • For pipelines, we can find a good-enough partition greedily and the best partition using dynamic programming. • For general DAGs, finding the best partition is NP-complete. • Our proof is approximation-preserving. An approximation algorithm for the partitioning problem, will work for our problem.

  23. Conclusions and Future Work • We can reduce the problem of minimizing cache misses to the problem of calculating the best partition. • Solving the partitioning problem: • Approximation algorithms. • Exact solution for special cases such as SP-DAGs. • Space bounds: Bound the buffer sizes on cross edges. • Cache-conscious scheduling for multicores.

  24. Deadlock Avoidance for Streaming Computations with Filtering • Goal: Devise mechanisms to avoid deadlocks on applications with filtering and finite buffers. with Peng Li, Jeremy Buhler, and Roger D. Chamberlain

  25. Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts

  26. Filtering Applications Model • Data dependent filtering: The number of items produced depends on the data. • When a node fires, it • has a compute index (CI), which monotonically increases, • consumes/produces 0 or 1items from input/output channels, • input/output items must have index = CI. • A node can not proceed until it is sure that it has received allitems of its current CI. • Channels can have unbounded delays. 3 1 2 U A X 3 2 1 1 2 B 2 1 Y 3 2 1 C Compute index 1 Aitemwith index 1

  27. A Deadlock Demo Filtering can cause deadlocks due to finite buffers. v 1 2 3 4 3 5 2 full 6 full 1 u x empty empty 4 3 w • A deadlock example (channel buffer size is 3).

  28. Contributions • Deadlock avoidance mechanism using dummy or heartbeatmessages sent at regular intervals • Provably correct --- guarantees deadlock freedom. • No global synchronization. • No dynamic buffer resizing. • Efficient algorithms to compute dummy intervals for structured DAGs such as series parallelDAGs and CS4DAGs

  29. Outline • Cache Conscious Scheduling • Streaming Application Model • The Sources of Cache Misses and Intuition Behind Partitioning • Proof Intuition • Thoughts • Deadlock Avoidance • Model and Source of Deadlocks • Deadlock Avoidance Using Dummy Items. • Thoughts

  30. The Naïve Algorithm • Filtering Theorem • If no node ever filters any token, then the system cannot deadlock • The Naïve Algorithm • Sends a dummy on every filtereditem. • Changes a filtering system to a non-filtering system. u 2 1 2 1 A X 1 A token with index 1 1 A dummy with index 1

  31. Comments on the Naïve Algorithm • Pros • Easy to schedule dummy items • Cons • Doesn’t utilize channel buffer sizes. • Sends many unnecessary dummy items, wastingboth computation and bandwidth. • Next step, reduce thenumberof dummy items.

  32. The Propagation Algorithm • Computes a static dummy schedule. • Sends dummies periodically based on dummy intervals. • Dummy items mustbe propagated to all downstream nodes. v 4 3 3 2 5 6 2 5 6 3, ∞ 4 1 3, 8 1 Dummy interval u x Channel buffer size 4, ∞ 4, 6 4 3 6 6 w Comp. Index: 6 Index of last dummy: 0 6 – 0 >= 6, send a dummy

  33. Comments on the Propagation Algorithm • Pros • Takes advantage of channel buffer sizes. • Greatly reduces the number of dummy items compared tothe Naïve Algorithm. • Cons • Does not utilize filtering history. • Dummy items mustbe propagated. • Next step, eliminate propagation • Use shorter dummy intervals. • Use filtering history for dummy scheduling.

  34. The Non-Propagation Algorithm • Send dummy items based on filtering history • Dummy items do not propagate. • If (index offiltered item– index ofprevioustoken/dummy) >= dummy interval, send a dummy v 4 3 3 2 5 2 5 6 3, 4 1 4 3, 4 1 Dummy interval u x Channel buffer size 4, 3 4, 3 Data filtered Current Index: 3 Index of last token/dummy: 0 3 – 0 >= 3, send a dummy 3 3 4 w

  35. Comparison of the Algorithms • Performance measurement • # of dummies sent • Fewer dummies are better • Non-Propagation Algorithm is expected to be the best in most cases • Experimental data • Mercury BLASTN (biological app.) • 787 billion input elements

  36. How Do we Compute These Intervals • Exponential time algorithms for general DAGs, since we have to enumerate cycles. • Can we do better for structured DAGs? • Yes. • Polynomial time algorithms for SP DAGs • Polynomial time algorithms for CS4 DAGs --- a class of DAGs where every undirected cycle has a single source and a single sink.

  37. Conclusions and Future Work • Designed efficient deadlock-avoidance algorithms using dummy messages. • Find polynomial algorithms to compute dummy intervalfor general DAGs. • Consider general models: allowing multiple outputs from one input and feedback loops. • The reverse problem: computing efficient buffer sizes from dummy intervals.

More Related