A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991 Presented by Tina Swenson April 15, 2010

Agenda • Introduction • Small Objects • Non-Blocking Transformation • Wait-free Transformation • Large Objects • Non-Blocking Transformation • Conclusion

Introduction

Key Words • Critical Section – In the author’s context, CS refers to blocking code. • Non-blocking (NB) – some process will complete its operation after a finite number of steps. • Wait-free (a.k.a. starvation-free) (WF) – all processes will complete their operations after a finite number of steps.

Motivation • Conventional Techniques – The use of a critical sections (by author’s definition) means only one process has access to the data. • Implementing NB/WF - We cannot use a critical section since it could cause a process to block forever (thus violating the definitions of NB And WF) • Practical issues addressed. • Reasoning is hard. • Fault tolerance is costly.

Automatic Transformations • Allow the programmer to reason and program sequentially. • The sequential code is converted into concurrent objects. • The author doesn’t specify what performs this transformation! • Access to the concurrent object is protected via atomic instructions.

Atomics Used • Load_linked • Copies the value of the shared variable to a local. • Watches the memory for any other processor accessing it. • Store_conditional • Uploads the new version to the shared variable, returning success or failure. • If LL tells SC that some other process accessed the memory, SC will fail.

Atomics Used • 3 Reasons for LL and SC: • Efficient implementation in cache-coherent architectures. • CAS instruction is inadequate. Less efficient & more complex. • LL and SC are easy to use (compared to CAS code).

Correctness • Linearizability. • Used as the basic correctness condition for the concurrent objects created by the automatic transformation. • Is this claim really strong enough? • What about this quote from p18? • “...as long as the store_conditional has no spurious failures, each operation will complete after at most 2 loop iterations.”

Priority Queues • The author implements a priority queue to test his new coding paradigm. • Dequeue Sequential Code int pqueue_deq(pqueue_type *p){ int best; if (!p->size) return PQUEUE_EMPTY; best = p->element[0]; p->element[0] = p->element[-- p->size]; pqueue_heapify(p, 0); return best; } Notice: No code to protect theshared data!

Hardware & Software Used • 18 Processors • National Semiconductor Encore Multimax NS32532 processors • Code implemented with C

Small Objects

Key Words • Small Object - An object that is small enough to be copied in one instruction. • Sequential Object – A data structure that occupies a fixed size, contiguous region of memory. • The Heap. • Concurrent Object – A shared variable that holds a pointer to a structure with 2 fields: • Version – the Heap • Check[2]

Small Objects Non-Blocking Transformations

Non-Blocking Transformation • Transforming a sequential object into a non-blocking concurrent object. • Our sequential program code must: • have no side-effects. • be total.

Race Condition • Processes X and Y read pointer to block b. • Y replaces b with b’. • X copies b while Y is copying b’ to b. • P’s copy may not be a valid state of the sequential object. • Solution – code example coming!Consistency check after copying the old version and before applying the sequential write.

The Code: Non-Blocking Typedef struct { pqueue_type version; unsigned check[2]; }Pqueue_type; ... • We’ve converted our sequential object (the heap) into a concurrent object! • versionis our original heap. • checkis our flag to help with race conditions.

The Code: Non-Blocking ... Static Pqueue_type *new_pqueue; int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; Pqueue_type *old_version; int result; unsigned first, last; ... • Local copies of pointers: • old_pqueue= the concurrent object • old_version= the heap. • resultis our priority queue value removed from this Pqueue_deq operation. • first, last help us with detecting a race condition. More later.

The Code: Non-Blocking Use our atomic primitive load_linkedto copy the concurrent object (loads into a register) and starts watching the memory for any other processor trying to access this memory. Dereference our old and new heaps, saving the version. int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } } new_pqueue = old_pqueue; return result; }

The Code: Non-Blocking Preventing the race condition! Copy the old, new data. If the check values do not match, loop again. We failed. int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } } new_pqueue = old_pqueue; return result; }

The Code: Non-Blocking If the check values DO match, now we can perform our dequeue operation! Try to publicize the new heap via store_conditional, which could fail and we loop back. Lastly, copy the old concurrent object pointer to the new concurrent pointer. Return our priority queue result. int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } } new_pqueue = old_pqueue; return result; }

Experimental Results Small Object, Non-Blocking (naive) • Ugh! • That’s terrible! • Bus contention • Starvation Wasted Parallelism!

Exponential Backoff ... if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } /* end while*/ new_pqueue = old_pqueue; return result; } When the consistency check or the store_conditional fails, introduce back-off for a random amount of time!

Experimental Results Small Object, Non-Blocking (back-off) Better, but NB is still not as fast as spin-locks (w/ backoff). Wasted Parallelism!

Small Objects Wait-Free Transformations

Key Words • Operational Combining – • Process starts an operation. • Record the call in Invocation. • Upon completion of the operation, record the result in Result.

Wait-Free Protocol • Based on non-blocking and applying operational combining. • Record an operation in Invocation. • Invocation structure: • operation name • argument value • toggle bit

Wait-Free Protocol • Concurrent object: • Version • check[2] • response[n] • All the processes share an array to announce invocations. New to our concurrent object! The pth element is the result of the last completed operation.

Wait-Free Protocol • When an operation starts, record the operation name and argument in announce[p] • When a process records a new invocation, flip the toggle bit inside the invocation struct! • Flipping the bit distinguishes old invocations from new invocations.

Wait-Free Protocol • New Function: Apply() • Does the work of any waiting threads before it does its own work. void apply (inv_type announce[MAX_PROCS], pqueue_type *object){ int i; for (i = 0; i < MAX_PROCS; i++){ if(announce[i].toggle != object->res_types[i].toggle){ switch(announce[i].op_name){ case ENG_CODE: object->res_type[i].value = pqueue_enq(&ojbect->version, announce[i].arg); break; case DEQ_CODE: object->res_type[i].value = pqueue_deq(&ojbect->version, announce[i].arg); break; default: fprintf(stderr, “Unknown operation code \n”); exit(1); }; object->res_types[i].toggle = announce[i].toggle; } } } For ALL Processes, do ALL the outstanding work!

The Code: Wait-Free responses is new to concurrent object. Pth element is the result of the last completed operation. announce[P]; Track all processes! Typedef struct { pqueue_type version; unsigned check[2]; responses[n]; }Pqueue_type; static Pqueue_type *new_pqueue; static int max_delay; static invocation announce[MAX_PROCS]; static int P; /* current process ID */ ...

The Code: Wait-Free int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; Pqueue_type *old_version, *new_version; int i, delay, result, new_toggle; unsigned first, last; announce[P].op_name = DEQ_CODE; new_toggle = announce[P].toggle = !announce[P].toggle; if (max_delay> 1) max_delay = max_delay >> 1; Record the process name. Flip the toggle bit.

Check the toggle bit TWICE! The author claims it avoids a race condition??? ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

Same as before. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

Pretty much same as before. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

apply pending operations to the NEW version. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

Same. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; }

Race Condition • P reads a pointer to version v (our heap). • Q replaces v with v’. • Q starts another operation. • Q checks the announce array and applies P’s operations to v’ and stores the result in v’s response array! • P sees the toggle bits match and returns. • Q fails to install v as the next version, thus ensuring P has the wrong result. • Solution: • Checking the value of the toggle bit twice. • What?

Experimental Results Wasted Parallelism!

Large Objects

Key Words • Large Objects - • Objects that are too large to be copied at once. • Represented by a set of blocks linked by pointers. • Logically Distinct – • An operation creates and returns a new object based on the old one. The old and new version may share a lot of memory.

Memory Management • Per-process pool of memory • 3 states: committed, allocated and freed • Operations: • set_alloc moves block from committed (freed?) to allocated and returns address • set_free moves block to freed • set_prepare marks blocks in allocated as consistent • set_commit sets committed to union of freed and committed • set_abort sets freed and allocated to the empty set

Performance Improvements • Skew Heap • Approximatly-balanced binary tree. • Easier to maintain, thus better performance. • The update process doesn’t touch most of the tree.

Experimental Results

Conclusion

Transforming Data • Transforming Data from Sequential To Concurrent. • Let programmer write sequentially without thought to memory. • Let some mechanism (e.g. compiler) do the transformation to concurrent automatically. • Key Instructions: • Load_Linked • Store_Conditional

General Observation • Is it really worth all the extra work and wasted parallelism just to avoid starvation? Just to ensure fault tolerance? • “We propose extremely simple and efficient memory management technieques...” Is this true? I doesn’t seem simple to me!

Going Forward • Resulting Research? • Are we in the wrong paradigm?

Thank You

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

A Methodology for Implementing Highly Concurrent Data Objects Maurice Herlihy October 1991

Presentation Transcript

Aspen -- A language for highly concurrent network applications

A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses

Transactional Memory: Architectural Support for Lock-Free Data Structures By Maurice Herlihy and J. Eliot B. Moss

Implementing Data Warehouse Methodology in Statistics Iceland

Implementing DOIs for Data

Methodology for Borrell Data

Data Methodology

Supporting Lock-Free Composition of Concurrent Data Objects

Computing with Concurrent Objects

Authors Eric Koskinen Maurice Herlihy Brown University Presented by – Nagashri Setty

Highly-Concurrent Data Structures

Concurrent Objects

Implementing COM Objects

Data Objects

Objects for Organizing Data --

Concurrent Objects

Objects for Organizing Data --

Deriving Linearizable Fine-Grained Concurrent Objects

Non-Blocking Concurrent Data Objects With Abstract Concurrency

Implementing a Two-Lock Concurrent Queue