1 / 27

Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk

Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk. Computer Laboratory. Overview. Introduction Lock-free data structures Correctness requirements Linked lists using CAS Multi-word CAS Conclusions. Introduction. What can go wrong here?. Thread1: getNumber().

chyna
Download Presentation

Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical non-blocking data structuresTim Harristim.harris@cl.cam.ac.uk Computer Laboratory

  2. Overview • Introduction • Lock-free data structures • Correctness requirements • Linked lists using CAS • Multi-word CAS • Conclusions

  3. Introduction • What can go wrong here? Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; } } t = 0 t = 0  result=0 result=0 next = 0 next = 1

  4. Introduction (2) • What about now? Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } } Lock acquired t = 0 Lock released result=0 result=1 next = 0 next = 1 next = 2

  5. Introduction (3) • Now the problem is liveness Thread1: getNumber() Thread2: getNumber() class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } } Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers… Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent

  6. Introduction (4) • In this case a non-blocking design is easy: class Counter { int next = 0; int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; } } Atomic compare and swap New value Expected value Location

  7. Correctness • Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990) • The data structure is only accessed through a well-defined interface • Operations on the data structure appear to occur atomically at some point between invocation and response • Liveness: usually one of two requirements • A ‘wait free’ implementation guarantees per-thread progress • A ‘non-blocking’ implementation guarantees only system-wide progress

  8. Overview • Introduction • Linked lists using CAS • Basic list operations • Alternative implementations • Extensions • Multi-word CAS • Conclusions

  9. 20 Lists using CAS • Insert 20:  30  20 30 30 H 10 10 T

  10. 20 25 Lists using CAS (2) • Insert 20:  30  20  30  25 30 H 10 T

  11. Lists using CAS (3) • Delete 10:  10  30 30 30 H H 10 10 T

  12. 20 Lists using CAS (4) • Delete 10 & insert 20:   10  30 30  20 30 30 30 30 H H H H 10 10 10 10 T 

  13.    10  30 10  30 30  30X 20 30  20 Logical vs physical deletion • Use a ‘spare’ bit to indicate logically deleted nodes:  30 30 H H 10 T 

  14. Write barrier 30  20 20 Implementation problems • Also need to consider visibility of updates   30 H 10 T

  15. 20 val = ??? Implementation problems (2) • …and the ordering of reads too while (val < seek) { p = p->next; val = p->val; }  30 30 10 H 10 T

  16. Overview • Introduction • Linked lists using CAS • Multi-word CAS • Design • Results • Conclusions

  17. Multi-word CAS • Atomic read-modify-write to a set of locations • A useful building block: • Many existing designs (queues, stacks, etc) use CAS2 directly (e.g. Detlefs ’00) • More generally it can be used to move a structure between consistent states • We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data

  18. …none of them practicable Parallel Requires Reserved bits p processors, word size w, max n locations, max a addresses Previous work • Lots of designs…

  19. Design  Build descriptor  Acquire locations 0x100 H  Decide outcome 0x104 DCSS (&status, UNDECIDED, 0x10C, 0x110, &descriptor) DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor)  Release locations CAS (&status, UNDECIDED, SUCCEEDED) 0x108 10 CAS (0x114, &descriptor, null) CAS (0x10C, &descriptor, 0x118) 0x10C 0x110 status=SUCCEEDED status=UNDECIDED 20 0x114 null locations=2 0x118 T a1=0x10C o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=<null> 0x11C

  20. Reading word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; } } 0x100 H 0x104 0x108 10 0x10C 0x110 status=UNDECIDED 20 0x114 locations=2 0x118 T a1=0x10c o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=<null> 0x11C

  21. 0x108 10 0x10C ac=0x200 oc=0 au=0x10C ou=0x110 nu=0x200 Whither DCSS? • Now we need DCSS from CAS: • Easier than full CAS2: the locations used for ‘control’ and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result • DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor): if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200) else CAS (0x10C, &DCSSDescriptor, 0x110); CAS (0x10C, 0x110, &DCSSDescriptor)

  22. Evaluation: method • Attempt to permute elements in a vector. Can control: • Level of concurrency • Length of the vector • Number of elements being permuted • Padding between elements • Management of descriptors 60 54 76 43 6 45 23

  23. Evaluation: small systems CASn width (words permuted per update) Algorithm used • gargantubrain.cl: 4-processor IA-64 (Itanium) • Vector=1024, Width=2-64, No padding • s per successful update

  24. Evaluation: large systems • hodgkin.hpcf: 64-processor • Origin-2000, MIPS R12000 • Vector=1024, Width=2 • One element per cache line MCS IR ms per successful update HF-RC Number of processors

  25. Overview • Introduction • Linked lists using CAS • Multi-word CAS • Conclusions

  26. Conclusions • Some general techniques • The descriptor pointers serve two purposes: • They allow ‘helpers’ to find out the information needed to complete their work. • They indicate ownership of locations • Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads • Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)

  27. Conclusions (2) • Our scheme is the first practical one: • Can operate on general pointer-based data structures • Competitive with lock-based schemes • Can operate on highly parallel systems • Disjoint-access parallel, non-blocking, linearizable http://www.cl.cam.ac.uk/~tlh20/papers/hfp-casn-submitted.pdf

More Related