Lockless Programming in Games

1. From the GDC web site: Overview: High-performance multi-threaded code sometimes requires communicating between threads without using locks. Lockless programming forces developers to consider reordering of reads and writes and subtle compiler behaviors in an area not covered by the C and C++ standards. This talk explains the pitfalls and how to dodge them for maximum performance, including real-world examples. From the GDC web site: Overview: High-performance multi-threaded code sometimes requires communicating between threads without using locks. Lockless programming forces developers to consider reordering of reads and writes and subtle compiler behaviors in an area not covered by the C and C++ standards. This talk explains the pitfalls and how to dodge them for maximum performance, including real-world examples.

2. Lockless Programming in Games Bruce Dawson Principal Software Design Engineer Microsoft Windows Client Performance

3. Agenda Locks and their problems Lockless programming � a different set of problems! Portable lockless programming Lockless algorithms that work Conclusions Focus is on improving intuition on the reordering aspects of lockless programming Here�s the rough agenda. However the main goal is to give attendees a better intuition about what reordering �feels� like so that you can recognize when it might happen and know what to do.Here�s the rough agenda. However the main goal is to give attendees a better intuition about what reordering �feels� like so that you can recognize when it might happen and know what to do.

4. Cell phones Please turn off all cell phones, pagers, alarm clocks, crying babies, internal combustion engines, leaf blowers, etc.

5. Mandatory Multi-core Mention Xbox 360: six hardware threads PS3: nine hardware threads Windows: quad-core PCs for $500 Multi-threading is mandatory if you want to harness the available power Luckily it's easy As long as there is no sharing of non-constant data Sharing data is tricky Easiest and safest way is to use OS features such as locks and semaphores If all access to shared data is done while you hold a lock, then everything behaves in a predictable way.If all access to shared data is done while you hold a lock, then everything behaves in a predictable way.

6. Simple Job Queue Assigning work: EnterCriticalSection( &workItemsLock ); workItems.push( workItem ); LeaveCriticalSection( &workItemsLock ); Worker threads: EnterCriticalSection( &workItemsLock ); WorkItem workItem = workItems.front(); workItems.pop(); LeaveCriticalSection( &workItemsLock ); DoWork( workItem ); The critical section is necessary to maintain the integrity of the queue as the worker threads and one or more threads assigning work all manipulate the shared workItems queue. More sophisticated systems with multiple job queues and work stealing are certainly possible and worthwhile. The critical section is necessary to maintain the integrity of the queue as the worker threads and one or more threads assigning work all manipulate the shared workItems queue. More sophisticated systems with multiple job queues and work stealing are certainly possible and worthwhile.

7. The Problem With Locks� Overhead � acquiring and releasing locks takes time So don�t acquire locks too often Deadlocks � lock acquisition order must be consistent to avoid these So don�t have very many locks, or only acquire one at a time Contention � sometimes somebody else has the lock So never hold locks for too long � contradicts point 1 So have lots of little locks � contradicts point 2 Priority inversions � if a thread is swapped out while holding a lock, progress may stall Changing thread priorities can lead to this Xbox 360 system threads can briefly cause this Note that the performance of locks is improving � critical sections have had their performance improved by orders of magnitude in the last ~ten years. There are cases where code has been �optimized� by eliminating locks only to end up running slower.Note that the performance of locks is improving � critical sections have had their performance improved by orders of magnitude in the last ~ten years. There are cases where code has been �optimized� by eliminating locks only to end up running slower.

8. Sensible Reaction Use locks carefully Don't lock too frequently Don't lock for too long Don't use too many locks Don't have one central lock Or, try lockless

9. Lockless Programming Techniques for safe multi-threaded data sharing without locks Pros: May have lower overhead Avoids deadlocks May reduce contention Avoids priority inversions Cons Very limited abilities Extremely tricky to get right Generally non-portable Are you sure you really need lockless programming? Limited abilities because your most powerful primitive is an atomic CompareAndSwap, and subtle and dangerous. Are you sure you really need lockless programming? Limited abilities because your most powerful primitive is an atomic CompareAndSwap, and subtle and dangerous.

10. Job Queue Again Assigning work: EnterCriticalSection( &workItemsLock ); workItems.push( workItem ); LeaveCriticalSection( &workItemsLock ); Worker threads: EnterCriticalSection( &workItemsLock ); WorkItem workItem = workItems.front(); workItems.pop(); LeaveCriticalSection( &workItemsLock ); DoWork( workItem ); Here�s our old friend the job queue with locksHere�s our old friend the job queue with locks

11. Lockless Job Queue #1 Assigning work: EnterCriticalSection( &workItemsLock ); InterlockedPushEntrySList( workItem ); LeaveCriticalSection( &workItemsLock ); Worker threads: EnterCriticalSection( &workItemsLock ); WorkItem workItem = InterlockedPopEntrySList(); LeaveCriticalSection( &workItemsLock ); DoWork( workItem ); First we change the STL collections for an SList, which is a lockless thread-safe singly-linked list that is part of Windows. First we change the STL collections for an SList, which is a lockless thread-safe singly-linked list that is part of Windows.

12. Lockless Job Stack #1 Assigning work: InterlockedPushEntrySList( workItem ); Worker threads: WorkItem workItem = InterlockedPopEntrySList(); DoWork( workItem ); Then once we are using a lockless list, we can get rid of the critical selection. Lockless! So, are we done? No. It is suboptimal because we no longer have a queue, we have a stack. So, our work items are no longer processed in order. Implementing data structures in a lockless manner is hard. And, on Xbox 360 this is broken for reasons I�ll get to later. It should really be InterlockedPushEntrySListRelease, and InterlockedPopEntrySListAcquire to make this safe on all platforms, otherwise it is broken on Xbox 360. What if we want a queue, and something lower overhead?Then once we are using a lockless list, we can get rid of the critical selection. Lockless! So, are we done? No. It is suboptimal because we no longer have a queue, we have a stack. So, our work items are no longer processed in order. Implementing data structures in a lockless manner is hard. And, on Xbox 360 this is broken for reasons I�ll get to later. It should really be InterlockedPushEntrySListRelease, and InterlockedPopEntrySListAcquire to make this safe on all platforms, otherwise it is broken on Xbox 360. What if we want a queue, and something lower overhead?

13. Lockless Job Queue #2 Assigning work � one writer only: if( RoomAvail( readPt, writePt ) ) { CircWorkList[ writePt ] = workItem; writePt = WRAP( writePt + 1 ); Worker thread � one reader only: if( DataAvail( writePt, readPt ) ) { WorkItem workItem = CircWorkList[ readPt ]; readPt = WRAP( readPt + 1 ); DoWork( workItem ); Version #1 of our lockless job queue had the limitation of being a stack rather than a queue. Version #2 of our lockless job queue has the limitation of having a single reader and single writer. It can be extended to multiple readers and writers by using interlocked operations, but this extension is tricky and reduces performance and will not be shown here. Welcome to the limited world of lockless programming. And, to top it all of, it�s broken. It is not reliable on any computer I know of because of compiler and CPU reordering. Lockless programming is perilous. CPUs and compilers will do anything to make your code run faster, even if that changes the meaning of your code. This is always hidden from you when you write single-threaded code, or if you write multi-threaded code using locks, but as soon as you write lockless code you are in a scary world � This scary world is very unintuitive. CPUs will rearrange reads and writes in ways that seem very unintuitive. However there is a method to their madness, so the first step is to visualize what is going on, and why. Version #1 of our lockless job queue had the limitation of being a stack rather than a queue. Version #2 of our lockless job queue has the limitation of having a single reader and single writer. It can be extended to multiple readers and writers by using interlocked operations, but this extension is tricky and reduces performance and will not be shown here. Welcome to the limited world of lockless programming. And, to top it all of, it�s broken. It is not reliable on any computer I know of because of compiler and CPU reordering. Lockless programming is perilous. CPUs and compilers will do anything to make your code run faster, even if that changes the meaning of your code. This is always hidden from you when you write single-threaded code, or if you write multi-threaded code using locks, but as soon as you write lockless code you are in a scary world � This scary world is very unintuitive. CPUs will rearrange reads and writes in ways that seem very unintuitive. However there is a method to their madness, so the first step is to visualize what is going on, and why.

14. Simple CPU/Compiler Model Read pC Write pA Write pB Read pD Write pC Simple model is that reads and writes happen instantaneously, one after another. This is how computers worked in the old days. 6502, 68000, these machines were so slow that memory could keep up and loads/stores happened exactly when you thought they did, with minimal pipelining. But actually, reads and writes take a really long time � too long. Note that �pass� (as in reads pass writes) means a newer operation (read) passes an older operation (write). Note also that this refers to reads and writes from a single CPU core passing each other. Each processor has an independent stream of events Point out that a write doesn�t take effect for other processors until it reaches shared storage. Simple model is that reads and writes happen instantaneously, one after another. This is how computers worked in the old days. 6502, 68000, these machines were so slow that memory could keep up and loads/stores happened exactly when you thought they did, with minimal pipelining. But actually, reads and writes take a really long time � too long. Note that �pass� (as in reads pass writes) means a newer operation (read) passes an older operation (write). Note also that this refers to reads and writes from a single CPU core passing each other. Each processor has an independent stream of events Point out that a write doesn�t take effect for other processors until it reaches shared storage.

15. Alternate CPU Model Write pA Write pB Write pC Visible order: Write pA Write pC Write pB Reads and writes take a really long time � too long. So instructions don�t wait for them. Note that all of the instructions complete before any of the writes reach shared storage (L2), and notice that the writes don�t reach shared storage in order. A write doesn�t take effect for other processors until it reaches shared storage. Reads and writes take a really long time � too long. So instructions don�t wait for them. Note that all of the instructions complete before any of the writes reach shared storage (L2), and notice that the writes don�t reach shared storage in order. A write doesn�t take effect for other processors until it reaches shared storage.

16. Alternate CPU � Reads Pass Reads Read A1 Read A2 Read A1 Visible order: Read A1 Read A1 Read A2 Point out that a read takes effect when the data leaves shared storage. The time that the data arrives at the processor core is not the relevant metric. Point out that a read takes effect when the data leaves shared storage. The time that the data arrives at the processor core is not the relevant metric.

17. Alternate CPU � Writes Pass Reads Read A1 Write A2 Visible order: Write A2 Read A1

18. Alternate CPU � Reads Pass Writes Read A1 Write A2 Read A2 Read A1 Visible order: Read A1 Read A1 Write A2 Read A2 Note that at this point the CPU that wrote to A2 will see the new value, but no other CPU will!!! Reads passing writes is the most expensive type of reordering for CPU designers to avoid. Note that at this point the CPU that wrote to A2 will see the new value, but no other CPU will!!! Reads passing writes is the most expensive type of reordering for CPU designers to avoid.

19. Memory Models "Pass" means "visible before" Memory models are actually more complex than this May vary for cacheable/non-cacheable, etc. This only affects multi-threaded lock-free code!!! * Only stores to different addresses can pass each other ** Loads to a previously stored address will load that value These are the four basic types of reordering. Actual causes may vary. We�ll see how they can all cause problems. It is the job of the compiler and of the CPU to only do rearrangements which are safe for single-threaded code. If they rearrange your reads and writes in such a way that it breaks your single-threaded code (with the exception of device drivers) then they are broken. This also doesn�t affect code that uses locks. Locks are always also memory barriers that prevent reordering so if you use, for instance, critical sections you can ignore these reordering problems. These are the four basic types of reordering. Actual causes may vary. We�ll see how they can all cause problems. It is the job of the compiler and of the CPU to only do rearrangements which are safe for single-threaded code. If they rearrange your reads and writes in such a way that it breaks your single-threaded code (with the exception of device drivers) then they are broken. This also doesn�t affect code that uses locks. Locks are always also memory barriers that prevent reordering so if you use, for instance, critical sections you can ignore these reordering problems.

20. Improbable CPU � Reads Don�t Pass Writes Read A1 Write A2 Read A1 Note the painfully slow read/probe of A1 that is not allowed to pass the write of A2. Such a CPU would be so slow that it would be impractical.Note the painfully slow read/probe of A1 that is not allowed to pass the write of A2. Such a CPU would be so slow that it would be impractical.

21. Reads Must Pass Writes! Reads not passing writes would mean L1 cache is frequently disabled Every read that follows a write would stall for shared storage latency Huge performance impact Therefore, on x86 and x64 (on all modern CPUs) reads can pass writes

22. Reordering Implications Publisher/Subscriber model Thread A: g_data = data; g_dataReady = true; Thread B: if( g_dataReady ) process( g_data ); Is it safe? This algorithm is particularly nice because it has the lowest possible communication overhead. Interlocked operations are usually needed for lockless algorithms and they can be expensive, especially as processor counts rise. This algorithm can be thread safe while just using normal reads and writes. But, as written above it is not safe. Common question: can threads see each other�s data? Yes, of course. Follow-up: when? Certainly not instantly. 50-100 cycles later on my current machine. But that�s not useful. More useful question is: how can I tell when my data is visible to another thread? Common implementation: write some data, then set a flag. Concrete question: if I write some data and then set a flag, will any threads be able to see the flag being set without seeing the data? Now we�re getting somewhere. Question is still too hard. Let�s break it into two parts: 1)�� If I write some data and then set a flag, is it possible for the flag to be set in shared memory before the data gets there? 2)�� If I read a flag and then if it is set read some data, is it possible for the data to have come from shared memory before the flag came from shared memory? (harder to think about) Short answer to both: yes. This algorithm is particularly nice because it has the lowest possible communication overhead. Interlocked operations are usually needed for lockless algorithms and they can be expensive, especially as processor counts rise. This algorithm can be thread safe while just using normal reads and writes. But, as written above it is not safe. Common question: can threads see each other�s data? Yes, of course. Follow-up: when? Certainly not instantly. 50-100 cycles later on my current machine. But that�s not useful. More useful question is: how can I tell when my data is visible to another thread? Common implementation: write some data, then set a flag. Concrete question: if I write some data and then set a flag, will any threads be able to see the flag being set without seeing the data? Now we�re getting somewhere. Question is still too hard. Let�s break it into two parts: 1)�� If I write some data and then set a flag, is it possible for the flag to be set in shared memory before the data gets there? 2)�� If I read a flag and then if it is set read some data, is it possible for the data to have come from shared memory before the flag came from shared memory? (harder to think about) Short answer to both: yes.

23. Publisher/Subscriber on PowerPC Proc 1: Write g_data Write g_dataReady Proc 2: Read g_dataReady Read g_data Writes may reach L2 out of order Proc 1 only does writes, and proc 2 only does reads, so write/read and read/write reordering is irrelevant. What happens with read/read and write/write reordering? Both can cause problems. The writes to g_data and g_dataReady can be reordered, which can allow stale data to be read.Proc 1 only does writes, and proc 2 only does reads, so write/read and read/write reordering is irrelevant. What happens with read/read and write/write reordering? Both can cause problems. The writes to g_data and g_dataReady can be reordered, which can allow stale data to be read.

24. Publisher/Subscriber on PowerPC Proc 1: Write g_data MyExportBarrier(); Write g_dataReady Proc 2: Read g_dataReady Read g_data Writes now reach L2 in order Inserting an export barrier avoid this problem � writes now reach L2/shared storage in order.Inserting an export barrier avoid this problem � writes now reach L2/shared storage in order.

25. Publisher/Subscriber on PowerPC Proc 1: Write g_data MyExportBarrier(); Write g_dataReady Proc 2: Read g_dataReady Read g_data Reads may leave L2 out of order � g_data may be stale However, reads can also be reordered, even on in-proc machines. The second read may be satisfied from L1 meaning that it may end up with stale data. More precisely, it may end up with data that is staler than the first read. Stale data is unavoidable because the speed of light is finite, but inconsistently stale data is a problem. g_dataReady may be fresher than g_dataReady What has happened is that the reads have left L2 out of order � g_data left L2 a long time ago � before this piece of code even executed. Other types of read-reordering may be possible.However, reads can also be reordered, even on in-proc machines. The second read may be satisfied from L1 meaning that it may end up with stale data. More precisely, it may end up with data that is staler than the first read. Stale data is unavoidable because the speed of light is finite, but inconsistently stale data is a problem. g_dataReady may be fresher than g_dataReady What has happened is that the reads have left L2 out of order � g_data left L2 a long time ago � before this piece of code even executed. Other types of read-reordering may be possible.

26. Publisher/Subscriber on PowerPC Proc 1: Write g_data MyExportBarrier(); Write g_dataReady Proc 2: Read g_dataReady MyImportBarrier(); Read g_data It's all good! Inserting an import barrier prevents read reordering. The actual implementation is not documented, but the effect is to make sure that reads following the barrier are at least as fresh as reads before the barrier. One possible implementation would be to send a message down to L2 that takes long enough to return that any pending invalidates are guaranteed to have time to come back from L2. The crucial takeaways are: Reordering can happen both on the publisher and subscriber sides of data sharing, so barriers are generally needed in both places. Barrier positioning is crucial � an export barrier prevents new writes from passing older writes. An import barrier ensures that new reads are at least as fresh as old writes. The positioning of the barrier determines what are �new� reads/writes and what are �old� reads/writes. This reordering can happen on in-order CPUs. Even though instructions are executed in order the load/store subsystem may have significant reordering built-in. It depends on the CPU. x86/x64 FTW!!!Inserting an import barrier prevents read reordering. The actual implementation is not documented, but the effect is to make sure that reads following the barrier are at least as fresh as reads before the barrier. One possible implementation would be to send a message down to L2 that takes long enough to return that any pending invalidates are guaranteed to have time to come back from L2. The crucial takeaways are: Reordering can happen both on the publisher and subscriber sides of data sharing, so barriers are generally needed in both places. Barrier positioning is crucial � an export barrier prevents new writes from passing older writes. An import barrier ensures that new reads are at least as fresh as old writes. The positioning of the barrier determines what are �new� reads/writes and what are �old� reads/writes. This reordering can happen on in-order CPUs. Even though instructions are executed in order the load/store subsystem may have significant reordering built-in. It depends on the CPU. x86/x64 FTW!!!

27. x86/x64 FTW!!! Not so fast� Compilers are just as evil as processors Compilers will rearrange your code as much as legally possible And compilers assume your code is single threaded Compiler and CPU reordering barriers needed x86/x64 CPUs do not let reads pass reads or writes pass writes. But, this same problem can happen even on x86/x64 systems, because the compiler may do these rearrangements for you. Compilers will do all four types of rearrangement (maybe more) x86/x64 CPUs do not let reads pass reads or writes pass writes. But, this same problem can happen even on x86/x64 systems, because the compiler may do these rearrangements for you. Compilers will do all four types of rearrangement (maybe more)

28. MyExportBarrier Prevents reordering of writes by compiler or CPU Used when handing out access to data x86/x64: _ReadWriteBarrier(); Compiler intrinsic, prevents compiler reordering PowerPC: __lwsync(); Hardware barrier, prevents CPU write reordering ARM: __dmb(); // Full hardware barrier IA64: __mf(); // Full hardware barrier Positioning is crucial! Write the data, MyExportBarrier, write the control value Export-barrier followed by write is known as write-release semantics Positioning is crucial! We need to make sure that g_data is written before g_dataReady, so the export barrier has to go between them. It�s possible that we could use _WriteBarrier instead of _ReadWriteBarrier on x86/x64 but it�s not worth the risk. __lwsync also works to prevent the compiler from reordering across it, so it is both a compiler and a CPU memory barrier. Positioning is crucial! We need to make sure that g_data is written before g_dataReady, so the export barrier has to go between them. It�s possible that we could use _WriteBarrier instead of _ReadWriteBarrier on x86/x64 but it�s not worth the risk. __lwsync also works to prevent the compiler from reordering across it, so it is both a compiler and a CPU memory barrier.

29. MyImportBarrier(); Prevents reordering of reads by compiler or CPU Used when gaining access to data x86/x64: _ReadWriteBarrier(); Compiler intrinsic, prevents compiler reordering PowerPC: __lwsync(); or isync(); Hardware barrier, prevents CPU read reordering ARM: __dmb(); // Full hardware barrier IA64: __mf(); // Full hardware barrier Positioning is crucial! Read the control value, MyImportBarrier, read the data Read followed by import-barrier is known as read-acquire semantics Positioning is again crucial! We need to make sure that g_dataReady is read before g_data, so we place the import barrier between them. isync can also be used to prevent read reordering, but lwsync is recommended on Xbox 360. It�s possible that we could use _ReadBarrier instead of _ReadWriteBarrier on x86/x64 but it�s not worth the risk. __lwsync prevents read passing read, write passing write, and write passing read Does not prevent read passing write Adding __lwsync after every instruction gives you the x86/x64 memory model __sync is full barrier Positioning is again crucial! We need to make sure that g_dataReady is read before g_data, so we place the import barrier between them. isync can also be used to prevent read reordering, but lwsync is recommended on Xbox 360. It�s possible that we could use _ReadBarrier instead of _ReadWriteBarrier on x86/x64 but it�s not worth the risk. __lwsync prevents read passing read, write passing write, and write passing read Does not prevent read passing write Adding __lwsync after every instruction gives you the x86/x64 memory model __sync is full barrier

30. Fixed Job Queue #2 Assigning work � one writer only: if( RoomAvail( readPt, writePt ) ) { MyImportBarrier(); CircWorkList[ writePt ] = workItem; MyExportBarrier(); writePt = WRAP( writePtr + 1 ); Worker thread � one reader only: if( DataAvail( writePt, readPt ) ) { MyImportBarrier(); WorkItem workItem = CircWorkList[ readPt ]; MyExportBarrier(); readPt = WRAP( readPt + 1 ); DoWork( workItem ); I have a nagging feeling that we actually need four barriers. As a general rule each thread in a lockless algorithm needs an import and an export barrier. The additional export barrier on the worker thread makes sure that the code doesn�t �free� the buffer space before it finishes reading from it (can happen during compiler reordering or writes passing reads). The additional import barrier on the writer thread avoids a possibly theoretical problem where write to the buffer are moved ahead of the check. This problem seems hard to imagine and the buffer was put in somewhat because it makes for greater symmetry which often implies correctness, however it is probably not needed on any processor except possibly the pathologically weird Alpha. A variant of this job queue, called the ATG::LockFreePipe ships in the Xbox 360 SDK and the DirectX SDK. It initially was missing the second two barriers. It has been used by several titles as a low-overhead way of sending commands between pairs of threads, with two pipes (one in each direction) for each pair of threads. I have a nagging feeling that we actually need four barriers. As a general rule each thread in a lockless algorithm needs an import and an export barrier. The additional export barrier on the worker thread makes sure that the code doesn�t �free� the buffer space before it finishes reading from it (can happen during compiler reordering or writes passing reads). The additional import barrier on the writer thread avoids a possibly theoretical problem where write to the buffer are moved ahead of the check. This problem seems hard to imagine and the buffer was put in somewhat because it makes for greater symmetry which often implies correctness, however it is probably not needed on any processor except possibly the pathologically weird Alpha. A variant of this job queue, called the ATG::LockFreePipe ships in the Xbox 360 SDK and the DirectX SDK. It initially was missing the second two barriers. It has been used by several titles as a low-overhead way of sending commands between pairs of threads, with two pipes (one in each direction) for each pair of threads.

31. Dekker�s/Peterson�s Algorithm int T1 = 0, T2 = 0; Proc 1: void LockForT1() { T1 = 1; if( T2 != 0 ) { � Proc 2: void LockForT2() { T2 = 1; if( T1 != 0 ) { � } For this code to work we need to be guaranteed that the reads and writes happen in exact program order x86/x64 reordering in this code could cause it not to work as expected For this code to work we need to be guaranteed that the reads and writes happen in exact program order x86/x64 reordering in this code could cause it not to work as expected

32. Dekker�s/Peterson�s Animation Proc 1: Write T1 Read T2 Proc 2: Write T2 Read T1 Epic fail! (on x86/x64 also) Reads passing writes. The fourth (and hardest to avoid) type of reordering.Reads passing writes. The fourth (and hardest to avoid) type of reordering.

33. Dekker�s/Peterson�s Animation Proc 1: Write T1 MemoryBarrier(); Read T2 Proc 2: Write T2 MemoryBarrier(); Read T1 It's all good!

34. Full Memory Barrier MemoryBarrier(); x86: __asm xchg Barrier, eax x64: __faststorefence(); Xbox 360: __sync(); ARM: __dmb(); IA64: __mf(); Needed for Dekker's algorithm, implementing locks, etc. Prevents all reordering � including preventing reads passing writes Most expensive barrier type If you need a full memory barrier then you�re in luck, there is an existing macro which gives you this on x86, x64, and Xbox 360 The regular memory barriers are not sufficient for this. MyExportBarrier and MyImportBarrier don�t prevent reads from passing writes, on any platform. Note that entering and exiting a lock such as a critical section does a full memory barrier. If you need a full memory barrier then you�re in luck, there is an existing macro which gives you this on x86, x64, and Xbox 360 The regular memory barriers are not sufficient for this. MyExportBarrier and MyImportBarrier don�t prevent reads from passing writes, on any platform. Note that entering and exiting a lock such as a critical section does a full memory barrier.

35. Dekker�s/Peterson�s Fixed int T1 = 0, T2 = 0; Proc 1: void LockForT1() { T1 = 1; MemoryBarrier(); if( T2 != 0 ) { � Proc 2: void LockForT2() { T2 = 1; MemoryBarrier(); if( T1 != 0 ) { � }

36. Dekker�s/Peterson�s Still Broken int T1 = 0, T2 = 0; Proc 1: void LockForT1() { T1 = 1; MyExportBarrier(); if( T2 != 0 ) { � Proc 2: void LockForT2() { T2 = 1; MyExportBarrier(); if( T1 != 0 ) { � } Only a full barrier (expensive as it is) can fix Dekker�s/Peterson�s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel�s lfence or sfence or both (load fence or store fence or both) also doesn�t help � these are not full barriers and do not prevent reads from passing writes.Only a full barrier (expensive as it is) can fix Dekker�s/Peterson�s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel�s lfence or sfence or both (load fence or store fence or both) also doesn�t help � these are not full barriers and do not prevent reads from passing writes.

37. Dekker�s/Peterson�s Still Broken int T1 = 0, T2 = 0; Proc 1: void LockForT1() { T1 = 1; MyImportBarrier(); if( T2 != 0 ) { � Proc 2: void LockForT2() { T2 = 1; MyImportBarrier(); if( T1 != 0 ) { � } Only a full barrier (expensive as it is) can fix Dekker�s/Peterson�s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel�s lfence or sfence or both (load fence or store fence or both) also doesn�t help � these are not full barriers and do not prevent reads from passing writes.Only a full barrier (expensive as it is) can fix Dekker�s/Peterson�s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel�s lfence or sfence or both (load fence or store fence or both) also doesn�t help � these are not full barriers and do not prevent reads from passing writes.

38. Dekker�s/Peterson�s Still Broken int T1 = 0, T2 = 0; Proc 1: void LockForT1() { T1 = 1; MyExportBarrier(); MyImportBarrier(); if( T2 != 0 ) { � Proc 2: void LockForT2() { T2 = 1; MyExportBarrier(); MyImportBarrier(); if( T1 != 0 ) { � } Only a full barrier (expensive as it is) can fix Dekker�s/Peterson�s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel�s lfence or sfence or both (load fence or store fence or both) also doesn�t help � these are not full barriers and do not prevent reads from passing writes.Only a full barrier (expensive as it is) can fix Dekker�s/Peterson�s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel�s lfence or sfence or both (load fence or store fence or both) also doesn�t help � these are not full barriers and do not prevent reads from passing writes.

39. What About Volatile? Standard volatile semantics not designed for multi-threading Compiler can move normal reads/writes past volatile reads/writes Also, doesn�t prevent CPU reordering VC++ 2005+ volatile is better� Acts as read-acquire/write-release on x86/x64 and Itanium Doesn�t prevent hardware reordering on Xbox 360 Watch for atomic<T> in C++0x Sequentially consistent by default but can choose from four memory models Note that on Itanium VC++ 2005+ will add the necessary hardware memory barriers. On x86/x64 it doesn�t need to. On these platforms marking g_dataReady as volatile is sufficient for publisher/subscriber. On Xbox 360 it needs to add hardware memory barriers, but doesn�t, so it�s almost useless. So, don�t use volatile for multi-threading. It�s behavior is far too inconsistent and it is not what it is intended for. Note that on Itanium VC++ 2005+ will add the necessary hardware memory barriers. On x86/x64 it doesn�t need to. On these platforms marking g_dataReady as volatile is sufficient for publisher/subscriber. On Xbox 360 it needs to add hardware memory barriers, but doesn�t, so it�s almost useless. So, don�t use volatile for multi-threading. It�s behavior is far too inconsistent and it is not what it is intended for.

40. Double Checked Locking Foo* GetFoo() { static Foo* volatile s_pFoo; Foo* tmp = s_pFoo; if( !tmp ) { EnterCriticalSection( &initLock ); tmp = s_pFoo; // Reload inside lock if( !tmp ) { tmp = new Foo(); s_pFoo = tmp; } LeaveCriticalSection( &initLock ); } return tmp; } This is broken on many systems C++ compliant compilers are allowed to reorder the instructions from the Foo() constructor and place them after the assigning of tmp to s_pFoo, thus making a partially constructed object visible. CPU reordering can also cause problems here. Additionally, CPU reordering can mean that the first CPU (other than the creator) to use the newly created object may see the s_pFoo pointer but not yet see the contents of the object. Thus, two memory barriers are needed � one between construction and publication, the other between reading the pointer and using it. C++ compliant compilers are allowed to reorder the instructions from the Foo() constructor and place them after the assigning of tmp to s_pFoo, thus making a partially constructed object visible. CPU reordering can also cause problems here. Additionally, CPU reordering can mean that the first CPU (other than the creator) to use the newly created object may see the s_pFoo pointer but not yet see the contents of the object. Thus, two memory barriers are needed � one between construction and publication, the other between reading the pointer and using it.

41. Possible Compiler Rewrite Foo* GetFoo() { static Foo* volatile s_pFoo; Foo* tmp = s_pFoo; if( !tmp ) { EnterCriticalSection( &initLock ); tmp = s_pFoo; // Reload inside lock if( !tmp ) { s_pFoo = (Foo*)new char[sizeof(Foo)]; new(s_pFoo) Foo; tmp = s_pFoo; } LeaveCriticalSection( &initLock ); } return tmp; } Here is an example of what a compiler rewrite of the initialization might look like. The compiler can allocate the memory, assign the pointer, and then construct the object. This is perfectly legal because it makes no difference to a single-threaded program. Here is an example of what a compiler rewrite of the initialization might look like. The compiler can allocate the memory, assign the pointer, and then construct the object. This is perfectly legal because it makes no difference to a single-threaded program.

42. Double Checked Locking Foo* GetFoo() { static Foo* volatile s_pFoo; Foo* tmp = s_pFoo; MyImportBarrier(); if( !tmp ) { EnterCriticalSection( &initLock ); tmp = s_pFoo; // Reload inside lock if( !tmp ) { tmp = new Foo(); MyExportBarrier(); s_pFoo = tmp; } LeaveCriticalSection( &initLock ); } return tmp; } Fixed Note that VC++ 2005 and later on Windows makes this code work perfectly without barriers because of the volatile semantics � except on Xbox 360 where no barriers are inserted even though they are needed. Sigh� The import barrier is frustrating because it is a small cost (__lwsync on PowerPC, no real cost on x86/x64) that must be paid every time GetFoo() is called. Many PowerPC programmers which that there was a type of export barrier which would sync all the way to other cores instead of just to shared storage so that an import barrier would not be needed. Unfortunately this would require halting all other cores and then waiting for all writes and invalidates to propagate before restarting the other cores. There typically are operating system methods for doing this but they aren�t exposed and they are hugely expensive. So, don�t call GetFoo() too frequently � cache the result. Also, if I�m reading the PowerPC documentation correctly, reading a pointer and then dereferencing the pointer gives you a guaranteed read order � the dereferenced reads are guaranteed to be fresher than the pointer, so the import barrier should be omittable, as long as the compiler can�t rearrange the reads. This is very subtle, will fail on some processors (Alpha for one), and is probably not worth it. Just cache the pointer. Note that VC++ 2005 and later on Windows makes this code work perfectly without barriers because of the volatile semantics � except on Xbox 360 where no barriers are inserted even though they are needed. Sigh� The import barrier is frustrating because it is a small cost (__lwsync on PowerPC, no real cost on x86/x64) that must be paid every time GetFoo() is called. Many PowerPC programmers which that there was a type of export barrier which would sync all the way to other cores instead of just to shared storage so that an import barrier would not be needed. Unfortunately this would require halting all other cores and then waiting for all writes and invalidates to propagate before restarting the other cores. There typically are operating system methods for doing this but they aren�t exposed and they are hugely expensive. So, don�t call GetFoo() too frequently � cache the result. Also, if I�m reading the PowerPC documentation correctly, reading a pointer and then dereferencing the pointer gives you a guaranteed read order � the dereferenced reads are guaranteed to be fresher than the pointer, so the import barrier should be omittable, as long as the compiler can�t rearrange the reads. This is very subtle, will fail on some processors (Alpha for one), and is probably not worth it. Just cache the pointer.

43. InterlockedXxx Necessary to extend lockless algorithms to greater than two threads A whole separate talk� InterlockedXxx is a full barrier on Windows for x86, x64, and Itanium Not a barrier at all on Xbox 360 Oops. Still atomic, just not a barrier InterlockedXxx Acquire and Release are portable across all platforms Same guarantees everywhere Safer than regular InterlockedXxx on Xbox 360 No difference on x86/x64 Recommended Using InterlockedXxx unfortunately means that your code is not portable to Xbox 360. You should always (with extremely rare exceptions) use InterlockedXxx Acquire or Release � whichever is appropriate � if you need Xbox 360 portability. InterlockedIncrement for reference counting is the most common time for wanting to avoid the cost of the Acquire and Release modifiers and the lwsync that they imply. If in doubt use InterlockedIncrementAcquire (because you are gaining access to the object and want to see other CPUs changes) and use InterlockedDecrementRelease (because you are giving up access to the object and want your changes to be visible). For immutable objects this isn�t necessary, but even immutable objects are modified at creation time, so be careful. Using InterlockedXxx unfortunately means that your code is not portable to Xbox 360. You should always (with extremely rare exceptions) use InterlockedXxx Acquire or Release � whichever is appropriate � if you need Xbox 360 portability. InterlockedIncrement for reference counting is the most common time for wanting to avoid the cost of the Acquire and Release modifiers and the lwsync that they imply. If in doubt use InterlockedIncrementAcquire (because you are gaining access to the object and want to see other CPUs changes) and use InterlockedDecrementRelease (because you are giving up access to the object and want your changes to be visible). For immutable objects this isn�t necessary, but even immutable objects are modified at creation time, so be careful.

44. Practical Lockless Uses Reference counts Setting a flag to tell a thread to exit Publisher/Subscriber with one reader and one writer � lockless pipe SLists XMCore on Xbox 360 Double checked locking Note that LockFreePipe ships in the Xbox 360 SDK and the DirectX SDKNote that LockFreePipe ships in the Xbox 360 SDK and the DirectX SDK

45. Barrier Summary MyExportBarrier when publishing data, to prevent write reordering MyImportBarrier when acquiring data, to prevent read reordering MemoryBarrier to stop all reordering, including reads passing writes Identify where you are publishing/releasing and where you are subscribing/acquiring These three barrier should be portable and sufficient for all of your reordering prevent needs.These three barrier should be portable and sufficient for all of your reordering prevent needs.

46. Summary Prefer using locks � they are full barriers Acquiring and releasing a lock is a memory barrier Use lockless only when costs of locks are shown to be too high Use pre-built lockless algorithms if possible Encapsulate lockless algorithms to make them safe to use Volatile is not a portable solution Remember that InterlockedXxx is a full barrier on Windows, but not on Xbox 360

47. References Intel memory model documentation in Intel� 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf AMD "Multiprocessor Memory Access Ordering" http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf PPC memory model explanation http://www.ibm.com/developerworks/eserver/articles/powerpc.html Lockless Programming Considerations for Xbox 360 and Microsoft Windows http://msdn.microsoft.com/en-us/library/bb310595.aspx Perils of Double Checked Locking http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf Java Memory Model Cookbook http://g.oswego.edu/dl/jmm/cookbook.html AMD�s Multiprocessor Memory Access Ordering is currently in section 7.2 of their documentation. Intel�s is also in section 7.2 of their documentation. AMD�s Multiprocessor Memory Access Ordering is currently in section 7.2 of their documentation. Intel�s is also in section 7.2 of their documentation.

48. Questions? bdawson@microsoft.com

49. Feedback forms Please fill out feedback forms

Lockless Programming in Games

Lockless Programming in Games

Presentation Transcript

Programming games

Programming games

Programming games

Programming games

Programming Games

Programming games

Programming games

Programming games

Programming games

Programming Games

Programming Games

Programming Games

Programming Games

Programming Games

Programming Games

Programming Games

Programming games

Programming Games

Programming Games