1 / 49

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991 Presented by Tina Swenson April 15, 2010. Agenda. Introduction Small Objects Non-Blocking Transformation Wait-free Transformation Large Objects Non-Blocking Transformation Conclusion.

damia
Download Presentation

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991 Presented by Tina Swenson April 15, 2010

  2. Agenda • Introduction • Small Objects • Non-Blocking Transformation • Wait-free Transformation • Large Objects • Non-Blocking Transformation • Conclusion

  3. Introduction

  4. Key Words • Critical Section – In the author’s context, CS refers to blocking code. • Non-blocking (NB) – some process will complete its operation after a finite number of steps. • Wait-free (a.k.a. starvation-free) (WF) – all processes will complete their operations after a finite number of steps.

  5. Motivation • Conventional Techniques – The use of a critical sections (by author’s definition) means only one process has access to the data. • Implementing NB/WF - We cannot use a critical section since it could cause a process to block forever (thus violating the definitions of NB And WF) • Practical issues addressed. • Reasoning is hard. • Fault tolerance is costly.

  6. Automatic Transformations • Allow the programmer to reason and program sequentially. • The sequential code is converted into concurrent objects. • The author doesn’t specify what performs this transformation! • Access to the concurrent object is protected via atomic instructions.

  7. Atomics Used • Load_linked • Copies the value of the shared variable to a local. • Watches the memory for any other processor accessing it. • Store_conditional • Uploads the new version to the shared variable, returning success or failure. • If LL tells SC that some other process accessed the memory, SC will fail.

  8. Atomics Used • 3 Reasons for LL and SC: • Efficient implementation in cache-coherent architectures. • CAS instruction is inadequate. Less efficient & more complex. • LL and SC are easy to use (compared to CAS code).

  9. Correctness • Linearizability. • Used as the basic correctness condition for the concurrent objects created by the automatic transformation. • Is this claim really strong enough? • What about this quote from p18? • “...as long as the store_conditional has no spurious failures, each operation will complete after at most 2 loop iterations.”

  10. Priority Queues • The author implements a priority queue to test his new coding paradigm. • Dequeue Sequential Code int pqueue_deq(pqueue_type *p){ int best; if (!p->size) return PQUEUE_EMPTY; best = p->element[0]; p->element[0] = p->element[-- p->size]; pqueue_heapify(p, 0); return best; } Notice: No code to protect theshared data!

  11. Hardware & Software Used • 18 Processors • National Semiconductor Encore Multimax NS32532 processors • Code implemented with C

  12. Small Objects

  13. Key Words • Small Object - An object that is small enough to be copied in one instruction. • Sequential Object – A data structure that occupies a fixed size, contiguous region of memory. • The Heap. • Concurrent Object – A shared variable that holds a pointer to a structure with 2 fields: • Version – the Heap • Check[2]

  14. Small Objects Non-Blocking Transformations

  15. Non-Blocking Transformation • Transforming a sequential object into a non-blocking concurrent object. • Our sequential program code must: • have no side-effects. • be total.

  16. Race Condition • Processes X and Y read pointer to block b. • Y replaces b with b’. • X copies b while Y is copying b’ to b. • P’s copy may not be a valid state of the sequential object. • Solution – code example coming!Consistency check after copying the old version and before applying the sequential write.

  17. The Code: Non-Blocking Typedef struct { pqueue_type version; unsigned check[2]; }Pqueue_type; ... • We’ve converted our sequential object (the heap) into a concurrent object! • versionis our original heap. • checkis our flag to help with race conditions.

  18. The Code: Non-Blocking ... Static Pqueue_type *new_pqueue; int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; Pqueue_type *old_version; int result; unsigned first, last; ... • Local copies of pointers: • old_pqueue= the concurrent object • old_version= the heap. • resultis our priority queue value removed from this Pqueue_deq operation. • first, last help us with detecting a race condition. More later.

  19. The Code: Non-Blocking Use our atomic primitive load_linkedto copy the concurrent object (loads into a register) and starts watching the memory for any other processor trying to access this memory. Dereference our old and new heaps, saving the version. int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } } new_pqueue = old_pqueue; return result; }

  20. The Code: Non-Blocking Preventing the race condition! Copy the old, new data. If the check values do not match, loop again. We failed. int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } } new_pqueue = old_pqueue; return result; }

  21. The Code: Non-Blocking If the check values DO match, now we can perform our dequeue operation! Try to publicize the new heap via store_conditional, which could fail and we loop back. Lastly, copy the old concurrent object pointer to the new concurrent pointer. Return our priority queue result. int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } } new_pqueue = old_pqueue; return result; }

  22. Experimental Results Small Object, Non-Blocking (naive) • Ugh! • That’s terrible! • Bus contention • Starvation Wasted Parallelism!

  23. Exponential Backoff ... if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } /* end while*/ new_pqueue = old_pqueue; return result; } When the consistency check or the store_conditional fails, introduce back-off for a random amount of time!

  24. Experimental Results Small Object, Non-Blocking (back-off) Better, but NB is still not as fast as spin-locks (w/ backoff). Wasted Parallelism!

  25. Small Objects Wait-Free Transformations

  26. Key Words • Operational Combining – • Process starts an operation. • Record the call in Invocation. • Upon completion of the operation, record the result in Result.

  27. Wait-Free Protocol • Based on non-blocking and applying operational combining. • Record an operation in Invocation. • Invocation structure: • operation name • argument value • toggle bit

  28. Wait-Free Protocol • Concurrent object: • Version • check[2] • response[n] • All the processes share an array to announce invocations. New to our concurrent object! The pth element is the result of the last completed operation.

  29. Wait-Free Protocol • When an operation starts, record the operation name and argument in announce[p] • When a process records a new invocation, flip the toggle bit inside the invocation struct! • Flipping the bit distinguishes old invocations from new invocations.

  30. Wait-Free Protocol • New Function: Apply() • Does the work of any waiting threads before it does its own work. void apply (inv_type announce[MAX_PROCS], pqueue_type *object){ int i; for (i = 0; i < MAX_PROCS; i++){ if(announce[i].toggle != object->res_types[i].toggle){ switch(announce[i].op_name){ case ENG_CODE: object->res_type[i].value = pqueue_enq(&ojbect->version, announce[i].arg); break; case DEQ_CODE: object->res_type[i].value = pqueue_deq(&ojbect->version, announce[i].arg); break; default: fprintf(stderr, “Unknown operation code \n”); exit(1); }; object->res_types[i].toggle = announce[i].toggle; } } } For ALL Processes, do ALL the outstanding work!

  31. The Code: Wait-Free responses is new to concurrent object. Pth element is the result of the last completed operation. announce[P]; Track all processes! Typedef struct { pqueue_type version; unsigned check[2]; responses[n]; }Pqueue_type; static Pqueue_type *new_pqueue; static int max_delay; static invocation announce[MAX_PROCS]; static int P; /* current process ID */ ...

  32. The Code: Wait-Free int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; Pqueue_type *old_version, *new_version; int i, delay, result, new_toggle; unsigned first, last; announce[P].op_name = DEQ_CODE; new_toggle = announce[P].toggle = !announce[P].toggle; if (max_delay> 1) max_delay = max_delay >> 1; Record the process name. Flip the toggle bit.

  33. Check the toggle bit TWICE! The author claims it avoids a race condition??? ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

  34. Same as before. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

  35. Pretty much same as before. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

  36. apply pending operations to the NEW version. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

  37. Same. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

  38. Race Condition • P reads a pointer to version v (our heap). • Q replaces v with v’. • Q starts another operation. • Q checks the announce array and applies P’s operations to v’ and stores the result in v’s response array! • P sees the toggle bits match and returns. • Q fails to install v as the next version, thus ensuring P has the wrong result. • Solution: • Checking the value of the toggle bit twice. • What?

  39. Experimental Results Wasted Parallelism!

  40. Large Objects

  41. Key Words • Large Objects - • Objects that are too large to be copied at once. • Represented by a set of blocks linked by pointers. • Logically Distinct – • An operation creates and returns a new object based on the old one. The old and new version may share a lot of memory.

  42. Memory Management • Per-process pool of memory • 3 states: committed, allocated and freed • Operations: • set_alloc moves block from committed (freed?) to allocated and returns address • set_free moves block to freed • set_prepare marks blocks in allocated as consistent • set_commit sets committed to union of freed and committed • set_abort sets freed and allocated to the empty set

  43. Performance Improvements • Skew Heap • Approximatly-balanced binary tree. • Easier to maintain, thus better performance. • The update process doesn’t touch most of the tree.

  44. Experimental Results

  45. Conclusion

  46. Transforming Data • Transforming Data from Sequential To Concurrent. • Let programmer write sequentially without thought to memory. • Let some mechanism (e.g. compiler) do the transformation to concurrent automatically. • Key Instructions: • Load_Linked • Store_Conditional

  47. General Observation • Is it really worth all the extra work and wasted parallelism just to avoid starvation? Just to ensure fault tolerance? • “We propose extremely simple and efficient memory management technieques...” Is this true? I doesn’t seem simple to me!

  48. Going Forward • Resulting Research? • Are we in the wrong paradigm?

  49. Thank You

More Related