430 likes | 583 Views
Idempotent Work Stealing. Maged M. Michael, Martin T. Vechev , Vijay A. Saraswat PPoPP’09. Outline. Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary. Memory Operations Reordering.
E N D
Idempotent Work Stealing Maged M. Michael, Martin T. Vechev, Vijay A. Saraswat PPoPP’09
Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary
Memory Operations Reordering • Some architectures reorder the memory accesses to achieve faster execution • Good optimization for uni-processors… • But may be dangerous for multi-processors read(a) read(b) write(a,1) write(b,2) read(a) write(b,2) write(a,1) read(b)
Memory Operations Reordering Memory a = 0; b = 0; P1 L1: if(read(a) = 0) goto L1 print(read(b)) P2 write(b, 7) write(a, 1) P1 P2 Expected output of P1? What happens if P2 changes the order of memory stores?
Memory Fences • Operations that synchronize memory accesses • X-Y fence: all previous operations of type X must commit before all following operations of type Y start • Example: store-load store-store? read1 write1 store-load write2 read2
Memory Operations Reordering –With Memory Fences Memory a = 0; b = 0; P1 L1: if (read(a) = 0) goto L1 print(read(b)) P2 write(b, 1) store-store write(a, 7) P1 P2
Sequential Consistency • A model where: • All processors see all memory operations in the same order • Must adhere to the program order (for each thread) • Memory operations are not sequential consistent Makes program verification a non-simple task
Sequential Consistency Vs. Linearizability • Linearizability is stronger than sequential consistency (and not only for a single thread) If operation A is executed before operation B (in real-time), then A precedes B in the order
Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary
Problem Definition - Idempotence • Idempotence – the property of certain operations, that can be applied multiple times without changing the result (Wikipedia) • In other words: f(f(x))=f(x) • Examples: • The absolute function • The number 1 is idempotent of multiplication: 1 * 1 • SQL query (without updates)
Problem Definition – Work Stealing • A policy to divide procedure executions (jobs/tasks) efficiently among multiple processors • Each processor has a deque (double-ended queue) of jobs job job job job job job job job job P1 P2 Pk
Problem Definition – Work Stealing • Each processor can put a new job in its own queue • Each processor can take a job from its own queue job job job job job job job job job job P1 P2 Pk
Problem Definition – Work Stealing • A processor without work can steal jobs from another processor job job job job job job job P1 P2 Pk
Work Stealing - Example • Fibonacci numbers – fib(7) • P1 – take() -> fib(7) • P1 – put(fib(6)), put(fib(5)) • P1 – take() -> fib(6) • P2 – steal(P1) • P2 – take() -> fib(5) • P1 – put(fib(5)), put(fib(4)) • P2 – put(fib(4)), put(fib(3)) • P1 – take() -> fib(5) • P3 – steal(P1) • P3 – take() -> fib(4) • P2 – take() -> fib(4) … fib(5) Fib(4) Fib(3) fib(7) Fib(6) Fib(5) Fib(4) P1 P2 P3
Well… • Work stealing seems like a good idea… • But, it can be expensive… • Because: • Using locks • Using atomic Read-Modify-Write operations • Using Memory Ordering Fence • Previous work-stealing algorithms use strong synchronization primitives Can Work-Stealing algorithms of Idempotent tasks avoid using synchronization primitives?
The answer • Not exactly… • Our goal: • Making Work-stealing cheap when jobs are idempotent • How? • Making the owner’s operations (“put”, “take”) cheap, but “steal” remains expensive
The Chase-Lev algorithm • A snippet of the Chase-Lev algorithm: Task take() { 1. b := bottom; 2. CircularArray a = activeArray; 3. b = b – 1; 4. bottom = b; 5. t = top; … } store-load
Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary
The algorithms • We will see 3 algorithms • All algorithms insert (put) jobs at the tail • Idempotent LIFO– extracting tasks (take/steal) from the tail • Idempotent FIFO – extracting tasks (take/steal) from the head • Idempotent double-ended – the owner takes tasks from the tail, and the others steal from the head
1) Idempotent LIFO insert – to tail take/steal from tail • Each processor has: • Dynamic array of tasks • A capacity variable • An anchor (tail index) tasks capacity = 7 anchor = 0 P1
Idempotent LIFO – put(task) void put(Task task) { 1. t := anchor; 2. if (t = capacity) { expand(); goto 1;} 3. tasks[t] := task; 4. anchor := t + 1; } store-store tasks task1 capacity = 7 anchor = 0 1
Idempotent LIFO – take() Task take() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. task := tasks[t – 1]; 4. anchor := t - 1; 5. return task; } tasks task1 task2 task3 capacity = 7 anchor = 3 2
Idempotent LIFO – steal() Task steal() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, t, t-1) goto 1; 6. return task; } load-load load-CAS Why tasks must be idempotent? tasks task1 task2 task3 capacity = 7 anchor = 3 2
Idempotent tasks Task take() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. task := tasks[t – 1]; 4. anchor := t - 1; 5. return task; } Task steal() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, t, t-1) goto 1; 6. return task; } task=task3 task=task3 t tasks t a task1 task2 task3 capacity = 7 anchor = 3 2 2
Preventing ABA • How is ABA possible? tasks t task1 task2 task3 taskX capacity = 7 anchor = 3 task=task3 3 2 2 owner take(); put(taskX); … put(taskY); Task steal() { 1. t := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, t, t-1) goto 1; 6. return task; } taskX is lost!
Preventing ABA • How can we prevent it? anchor: <integer, integer>; // <tail, tag> void put(Task task) { 1. <t,tag> := anchor; 2. if (t = capacity) { expand(); goto 1;} 3. tasks[t] := task; 4. anchor := <t + 1, tag + 1>; } Task steal() { 1. <t,tag> := anchor; 2. if (t = 0) return EMPTY; 3. a := tasks; 4. task := a[t – 1]; 5. if !CAS(anchor, <t,tag>, <t-1,tag>) goto 1; 6. return task; }
2) Idempotent FIFO insert – to tail take/steal from head • Each processor has: • Dynamic cyclic-array of tasks • A capacity variable • Head index (always increasing) • Tail index (always increasing) tasks task2 task3 task4 capacity = 7 head = 1 tail = 4 P1 Next…
Idempotent FIFO – put(task) void put(Task task) { 1. h := head; 2. t := tail; 3. if (t = h + tasks.capacity) { expand(); goto 1;} 4. tasks.array[t%tasks.capacity] := task; 5. tail := t + 1; } store-store task2 task3 task4 task5 capacity = 7 head = 1 tail = 4 5
Idempotent FIFO – take() Task take() { 1. h := head; 2. t := tail; 3. if (h = t) return EMPTY; 4. task := tasks.array[h%tasks.capacity]; 5. head := h + 1; 6. return task; } task2 task3 task4 task5 capacity = 7 head = 1 tail = 4 2
Idempotent FIFO – steal() load-load Task steal() { 1. h := head; 2. t := tail; 3. if (h = t) return EMPTY; 4. a := tasks; 5. task := a.array[h%a.capacity]; 6. if !CAS(head, h, h+1) goto 1; 7. return task; } load-load load-CAS task2 task3 task4 task5 capacity = 7 head = 1 tail = 4 2
3) Idempotent double-ended insert – to tail take – from tail steal - from head • Each processor has: • Dynamic cyclic-array of tasks • A capacity variable • An anchor (head, size) tasks task2 task3 task4 capacity= 7 anchor = <1, 3> P1 Next…
Idempotent double-ended – put(task) void put(Task task) { 1. <h, s> := anchor; 2. if (s = tasks.capacity) { expand(); goto 1;} 3. tasks.array[(h+s)%tasks.capacity] := task; 4. anchor := <h, s + 1>; } store-store task2 task3 task4 task5 capacity = 7 anchor = <1, 3> 4
Idempotent double-ended – take() Task take() { 1. <h, s> := anchor; 2. if (s = 0) return EMPTY; 3. task := tasks.array[(h+s-1)%tasks.capacity]; 4. anchor := <h, s – 1>; 5. return task; } task2 task3 task4 task5 capacity = 7 anchor = <1, 4> 3
Idempotent double-ended – steal() Task steal() { 1. <h, s> := head; 2. if (s = 0) return EMPTY; 3. a := tasks; 4. task := a.array[h%a.capacity]; 5. h2 := (h + 1) % a.capacity; 6. if !CAS(head, <h,s>, <h2,s-1>) goto 1; 7. return task; } load-load load-CAS task2 task3 task4 task5 capacity = 7 anchor = <1, 4 > 2, 3
Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary
Experimental evaluation • Compared against “Chase-Lev” and “Cilk THE” algorithms (after adding memory fences) • Benchmarks: • Micro – the common case – take() and put() • Irregular Graph Applications
Micro-benchmarks • 2 Scenarios: • Both puts and takes (106 ops for each type) • Only takes (106 ops) – pre populating the work-queues
Micro-benchmarks • 2 Scenarios: • Both puts and takes (106 ops for each type) • Only takes (106 ops) – pre populating the work-queues
Irregular Graph Applications • Based on SIMPLE framework • 2D Torus Graph: • Vertices – on the torus • Each vertex connected to its 4 neighbors • Build a spanning tree
2D-Torus Up to 6% redundant work
Outline • Memory Operations Reordering • Problem Definition – Idempotent Work-Stealing • The algorithms • Comparison to Previous Work • Summary
Summary • Memory operations reordering improves execution times • Use with care in multi-processors • “Idempotent Work-Stealing” useful for some workloads • Idempotent-LIFO gives good results for all benchmarks
Thank You! Questions?