270 likes | 419 Views
Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk. Computer Laboratory. Overview. Introduction Lock-free data structures Correctness requirements Linked lists using CAS Multi-word CAS Conclusions. Introduction. What can go wrong here?. Thread1: getNumber().
E N D
Practical non-blocking data structuresTim Harristim.harris@cl.cam.ac.uk Computer Laboratory
Overview • Introduction • Lock-free data structures • Correctness requirements • Linked lists using CAS • Multi-word CAS • Conclusions
Introduction • What can go wrong here? Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; } } t = 0 t = 0 result=0 result=0 next = 0 next = 1
Introduction (2) • What about now? Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } } Lock acquired t = 0 Lock released result=0 result=1 next = 0 next = 1 next = 2
Introduction (3) • Now the problem is liveness Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } } Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers… Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent
Introduction (4) • In this case a non-blocking design is easy: class Counter { int next = 0; int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; } } Atomic compare and swap New value Expected value Location
Correctness • Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990) • The data structure is only accessed through a well-defined interface • Operations on the data structure appear to occur atomically at some point between invocation and response • Liveness: usually one of two requirements • A ‘wait free’ implementation guarantees per-thread progress • A ‘non-blocking’ implementation guarantees only system-wide progress
Overview • Introduction • Linked lists using CAS • Basic list operations • Alternative implementations • Extensions • Multi-word CAS • Conclusions
20 Lists using CAS • Insert 20: 30 20 30 30 H 10 10 T
20 25 Lists using CAS (2) • Insert 20: 30 20 30 25 30 H 10 T
Lists using CAS (3) • Delete 10: 10 30 30 30 H H 10 10 T
20 Lists using CAS (4) • Delete 10 & insert 20: 10 30 30 20 30 30 30 30 H H H H 10 10 10 10 T
10 30 10 30 30 30X 20 30 20 Logical vs physical deletion • Use a ‘spare’ bit to indicate logically deleted nodes: 30 30 H H 10 T
Write barrier 30 20 20 Implementation problems • Also need to consider visibility of updates 30 H 10 T
20 val = ??? Implementation problems (2) • …and the ordering of reads too while (val < seek) { p = p->next; val = p->val; } 30 30 10 H 10 T
Overview • Introduction • Linked lists using CAS • Multi-word CAS • Design • Results • Conclusions
Multi-word CAS • Atomic read-modify-write to a set of locations • A useful building block: • Many existing designs (queues, stacks, etc) use CAS2 directly (e.g. Detlefs ’00) • More generally it can be used to move a structure between consistent states • We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data
…none of them practicable Parallel Requires Reserved bits p processors, word size w, max n locations, max a addresses Previous work • Lots of designs…
Design Build descriptor Acquire locations 0x100 H Decide outcome 0x104 DCSS (&status, UNDECIDED, 0x10C, 0x110, &descriptor) DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor) Release locations CAS (&status, UNDECIDED, SUCCEEDED) 0x108 10 CAS (0x114, &descriptor, null) CAS (0x10C, &descriptor, 0x118) 0x10C 0x110 status=SUCCEEDED status=UNDECIDED 20 0x114 null locations=2 0x118 T a1=0x10C o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=<null> 0x11C
Reading word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; } } 0x100 H 0x104 0x108 10 0x10C 0x110 status=UNDECIDED 20 0x114 locations=2 0x118 T a1=0x10c o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=<null> 0x11C
0x108 10 0x10C ac=0x200 oc=0 au=0x10C ou=0x110 nu=0x200 Whither DCSS? • Now we need DCSS from CAS: • Easier than full CAS2: the locations used for ‘control’ and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result • DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor): if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200) else CAS (0x10C, &DCSSDescriptor, 0x110); CAS (0x10C, 0x110, &DCSSDescriptor)
Evaluation: method • Attempt to permute elements in a vector. Can control: • Level of concurrency • Length of the vector • Number of elements being permuted • Padding between elements • Management of descriptors 60 54 76 43 6 45 23
Evaluation: small systems CASn width (words permuted per update) Algorithm used • gargantubrain.cl: 4-processor IA-64 (Itanium) • Vector=1024, Width=2-64, No padding • s per successful update
Evaluation: large systems • hodgkin.hpcf: 64-processor • Origin-2000, MIPS R12000 • Vector=1024, Width=2 • One element per cache line MCS IR ms per successful update HF-RC Number of processors
Overview • Introduction • Linked lists using CAS • Multi-word CAS • Conclusions
Conclusions • Some general techniques • The descriptor pointers serve two purposes: • They allow ‘helpers’ to find out the information needed to complete their work. • They indicate ownership of locations • Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads • Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)
Conclusions (2) • Our scheme is the first practical one: • Can operate on general pointer-based data structures • Competitive with lock-based schemes • Can operate on highly parallel systems • Disjoint-access parallel, non-blocking, linearizable http://www.cl.cam.ac.uk/~tlh20/papers/hfp-casn-submitted.pdf