1 / 43

Idempotent Work Stealing

Idempotent Work Stealing. Maged M. Michael, Martin T. Vechev , Vijay A. Saraswat PPoPP’09. Outline. Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary. Memory Operations Reordering.

lilah
Download Presentation

Idempotent Work Stealing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Idempotent Work Stealing Maged M. Michael, Martin T. Vechev, Vijay A. Saraswat PPoPP’09

  2. Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary

  3. Memory Operations Reordering • Some architectures reorder the memory accesses to achieve faster execution • Good optimization for uni-processors… • But may be dangerous for multi-processors read(a) read(b) write(a,1) write(b,2) read(a) write(b,2) write(a,1) read(b)

  4. Memory Operations Reordering Memory a = 0; b = 0; P1 L1: if(read(a) = 0) goto L1 print(read(b)) P2 write(b, 7) write(a, 1) P1 P2 Expected output of P1? What happens if P2 changes the order of memory stores?

  5. Memory Fences • Operations that synchronize memory accesses • X-Y fence: all previous operations of type X must commit before all following operations of type Y start • Example: store-load store-store? read1   write1  store-load  write2   read2

  6. Memory Operations Reordering –With Memory Fences Memory a = 0; b = 0; P1 L1: if (read(a) = 0) goto L1 print(read(b)) P2 write(b, 1) store-store write(a, 7) P1 P2

  7. Sequential Consistency • A model where: • All processors see all memory operations in the same order • Must adhere to the program order (for each thread) • Memory operations are not sequential consistent Makes program verification a non-simple task

  8. Sequential Consistency Vs. Linearizability • Linearizability is stronger than sequential consistency (and not only for a single thread) If operation A is executed before operation B (in real-time), then A precedes B in the order

  9. Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary

  10. Problem Definition - Idempotence • Idempotence – the property of certain operations, that can be applied multiple times without changing the result (Wikipedia) • In other words: f(f(x))=f(x) • Examples: • The absolute function • The number 1 is idempotent of multiplication: 1 * 1 • SQL query (without updates)

  11. Problem Definition – Work Stealing • A policy to divide procedure executions (jobs/tasks) efficiently among multiple processors • Each processor has a deque (double-ended queue) of jobs job job job job job job job job job P1 P2 Pk

  12. Problem Definition – Work Stealing • Each processor can put a new job in its own queue • Each processor can take a job from its own queue job job job job job job job job job job P1 P2 Pk

  13. Problem Definition – Work Stealing • A processor without work can steal jobs from another processor job job job job job job job P1 P2 Pk

  14. Work Stealing - Example • Fibonacci numbers – fib(7) • P1 – take() -> fib(7) • P1 – put(fib(6)), put(fib(5)) • P1 – take() -> fib(6) • P2 – steal(P1) • P2 – take() -> fib(5) • P1 – put(fib(5)), put(fib(4)) • P2 – put(fib(4)), put(fib(3)) • P1 – take() -> fib(5) • P3 – steal(P1) • P3 – take() -> fib(4) • P2 – take() -> fib(4) … fib(5) Fib(4) Fib(3) fib(7) Fib(6) Fib(5) Fib(4) P1 P2 P3

  15. Well… • Work stealing seems like a good idea… • But, it can be expensive… • Because: • Using locks • Using atomic Read-Modify-Write operations • Using Memory Ordering Fence • Previous work-stealing algorithms use strong synchronization primitives Can Work-Stealing algorithms of Idempotent tasks avoid using synchronization primitives?

  16. The answer • Not exactly… • Our goal: • Making Work-stealing cheap when jobs are idempotent • How? • Making the owner’s operations (“put”, “take”) cheap, but “steal” remains expensive

  17. The Chase-Lev algorithm • A snippet of the Chase-Lev algorithm: Task take() { 1. b := bottom; 2. CircularArray a = activeArray; 3. b = b – 1; 4. bottom = b; 5. t = top; … } store-load

  18. Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary

  19. The algorithms • We will see 3 algorithms • All algorithms insert (put) jobs at the tail • Idempotent LIFO– extracting tasks (take/steal) from the tail • Idempotent FIFO – extracting tasks (take/steal) from the head • Idempotent double-ended – the owner takes tasks from the tail, and the others steal from the head

  20. 1) Idempotent LIFO insert – to tail take/steal from tail • Each processor has: • Dynamic array of tasks • A capacity variable • An anchor (tail index) tasks capacity = 7 anchor = 0 P1

  21. Idempotent LIFO – put(task) void put(Task task) { 1. t := anchor; 2. if (t = capacity) { expand(); goto 1;} 3. tasks[t] := task; 4. anchor := t + 1; } store-store tasks task1 capacity = 7 anchor = 0 1

  22. Idempotent LIFO – take() Task take() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. task := tasks[t – 1]; 4. anchor := t - 1; 5. return task; } tasks task1 task2 task3 capacity = 7 anchor = 3 2

  23. Idempotent LIFO – steal() Task steal() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, t, t-1) goto 1; 6. return task; } load-load load-CAS Why tasks must be idempotent? tasks task1 task2 task3 capacity = 7 anchor = 3 2

  24. Idempotent tasks Task take() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. task := tasks[t – 1]; 4. anchor := t - 1; 5. return task; } Task steal() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, t, t-1) goto 1; 6. return task; } task=task3 task=task3 t tasks t a task1 task2 task3 capacity = 7 anchor = 3 2 2

  25. Preventing ABA • How is ABA possible? tasks t task1 task2 task3 taskX capacity = 7 anchor = 3 task=task3 3 2 2 owner take(); put(taskX); … put(taskY); Task steal() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, t, t-1) goto 1; 6. return task; } taskX is lost! 

  26. Preventing ABA • How can we prevent it? anchor: <integer, integer>; // <tail, tag> void put(Task task) { 1. <t,tag> := anchor; 2. if (t = capacity) { expand(); goto 1;} 3. tasks[t] := task; 4. anchor := <t + 1, tag + 1>; } Task steal() { 1. <t,tag> := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, <t,tag>, <t-1,tag>) goto 1; 6. return task; }

  27. 2) Idempotent FIFO insert – to tail take/steal from head • Each processor has: • Dynamic cyclic-array of tasks • A capacity variable • Head index (always increasing) • Tail index (always increasing) tasks task2 task3 task4 capacity = 7 head = 1 tail = 4 P1 Next…

  28. Idempotent FIFO – put(task) void put(Task task) { 1. h := head; 2. t := tail; 3. if (t = h + tasks.capacity) { expand(); goto 1;} 4. tasks.array[t%tasks.capacity] := task; 5. tail := t + 1; } store-store task2 task3 task4 task5 capacity = 7 head = 1 tail = 4 5

  29. Idempotent FIFO – take() Task take() { 1. h := head; 2. t := tail; 3. if (h = t) return EMPTY; 4. task := tasks.array[h%tasks.capacity]; 5. head := h + 1; 6. return task; } task2 task3 task4 task5 capacity = 7 head = 1 tail = 4 2

  30. Idempotent FIFO – steal() load-load Task steal() { 1. h := head; 2. t := tail; 3. if (h = t) return EMPTY; 4. a := tasks; 5. task := a.array[h%a.capacity]; 6. if !CAS(head, h, h+1) goto 1; 7. return task; } load-load load-CAS task2 task3 task4 task5 capacity = 7 head = 1 tail = 4 2

  31. 3) Idempotent double-ended insert – to tail take – from tail steal - from head • Each processor has: • Dynamic cyclic-array of tasks • A capacity variable • An anchor (head, size) tasks task2 task3 task4 capacity= 7 anchor = <1, 3> P1 Next…

  32. Idempotent double-ended – put(task) void put(Task task) { 1. <h, s> := anchor; 2. if (s = tasks.capacity) { expand(); goto 1;} 3. tasks.array[(h+s)%tasks.capacity] := task; 4. anchor := <h, s + 1>; } store-store task2 task3 task4 task5 capacity = 7 anchor = <1, 3> 4

  33. Idempotent double-ended – take() Task take() { 1. <h, s> := anchor; 2. if (s = 0) return EMPTY; 3. task := tasks.array[(h+s-1)%tasks.capacity]; 4. anchor := <h, s – 1>; 5. return task; } task2 task3 task4 task5 capacity = 7 anchor = <1, 4> 3

  34. Idempotent double-ended – steal() Task steal() { 1. <h, s> := head; 2. if (s = 0) return EMPTY; 3. a := tasks; 4. task := a.array[h%a.capacity]; 5. h2 := (h + 1) % a.capacity; 6. if !CAS(head, <h,s>, <h2,s-1>) goto 1; 7. return task; } load-load load-CAS task2 task3 task4 task5 capacity = 7 anchor = <1, 4 > 2, 3

  35. Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary

  36. Experimental evaluation • Compared against “Chase-Lev” and “Cilk THE” algorithms (after adding memory fences) • Benchmarks: • Micro – the common case – take() and put() • Irregular Graph Applications

  37. Micro-benchmarks • 2 Scenarios: • Both puts and takes (106 ops for each type) • Only takes (106 ops) – pre populating the work-queues

  38. Micro-benchmarks • 2 Scenarios: • Both puts and takes (106 ops for each type) • Only takes (106 ops) – pre populating the work-queues

  39. Irregular Graph Applications • Based on SIMPLE framework • 2D Torus Graph: • Vertices – on the torus • Each vertex connected to its 4 neighbors • Build a spanning tree

  40. 2D-Torus Up to 6% redundant work

  41. Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary

  42. Summary • Memory operations reordering improves execution times • Use with care in multi-processors • “Idempotent Work-Stealing” useful for some workloads • Idempotent-LIFO gives good results for all benchmarks

  43. Thank You! Questions?

More Related