480 likes | 1.06k Views
Lockless Programming in Games. Bruce DawsonPrincipal Software Design EngineerMicrosoftWindows Client Performance. Agenda. Locks and their problemsLockless programming
E N D
1. From the GDC web site:
Overview: High-performance multi-threaded code sometimes requires communicating between threads without using locks. Lockless programming forces developers to consider reordering of reads and writes and subtle compiler behaviors in an area not covered by the C and C++ standards. This talk explains the pitfalls and how to dodge them for maximum performance, including real-world examples.
From the GDC web site:
Overview: High-performance multi-threaded code sometimes requires communicating between threads without using locks. Lockless programming forces developers to consider reordering of reads and writes and subtle compiler behaviors in an area not covered by the C and C++ standards. This talk explains the pitfalls and how to dodge them for maximum performance, including real-world examples.
2. Lockless Programming in Games Bruce Dawson
Principal Software Design Engineer
Microsoft
Windows Client Performance
3. Agenda Locks and their problems
Lockless programming – a different set of problems!
Portable lockless programming
Lockless algorithms that work
Conclusions
Focus is on improving intuition on the reordering aspects of lockless programming Here’s the rough agenda. However the main goal is to give attendees a better intuition about what reordering ‘feels’ like so that you can recognize when it might happen and know what to do.Here’s the rough agenda. However the main goal is to give attendees a better intuition about what reordering ‘feels’ like so that you can recognize when it might happen and know what to do.
4. Cell phones Please turn off all cell phones, pagers, alarm clocks, crying babies, internal combustion engines, leaf blowers, etc.
5. Mandatory Multi-core Mention Xbox 360: six hardware threads
PS3: nine hardware threads
Windows: quad-core PCs for $500
Multi-threading is mandatory if you want to harness the available power
Luckily it's easy
As long as there is no sharing of non-constant data
Sharing data is tricky
Easiest and safest way is to use OS features such as locks and semaphores If all access to shared data is done while you hold a lock, then everything behaves in a predictable way.If all access to shared data is done while you hold a lock, then everything behaves in a predictable way.
6. Simple Job Queue Assigning work:
EnterCriticalSection( &workItemsLock );
workItems.push( workItem );
LeaveCriticalSection( &workItemsLock );
Worker threads:
EnterCriticalSection( &workItemsLock );
WorkItem workItem = workItems.front();
workItems.pop();
LeaveCriticalSection( &workItemsLock );
DoWork( workItem );
The critical section is necessary to maintain the integrity of the queue as the worker threads and one or more threads assigning work all manipulate the shared workItems queue.
More sophisticated systems with multiple job queues and work stealing are certainly possible and worthwhile.
The critical section is necessary to maintain the integrity of the queue as the worker threads and one or more threads assigning work all manipulate the shared workItems queue.
More sophisticated systems with multiple job queues and work stealing are certainly possible and worthwhile.
7. The Problem With Locks… Overhead – acquiring and releasing locks takes time
So don’t acquire locks too often
Deadlocks – lock acquisition order must be consistent to avoid these
So don’t have very many locks, or only acquire one at a time
Contention – sometimes somebody else has the lock
So never hold locks for too long – contradicts point 1
So have lots of little locks – contradicts point 2
Priority inversions – if a thread is swapped out while holding a lock, progress may stall
Changing thread priorities can lead to this
Xbox 360 system threads can briefly cause this Note that the performance of locks is improving – critical sections have had their performance improved by orders of magnitude in the last ~ten years. There are cases where code has been “optimized” by eliminating locks only to end up running slower.Note that the performance of locks is improving – critical sections have had their performance improved by orders of magnitude in the last ~ten years. There are cases where code has been “optimized” by eliminating locks only to end up running slower.
8. Sensible Reaction Use locks carefully
Don't lock too frequently
Don't lock for too long
Don't use too many locks
Don't have one central lock
Or, try lockless
9. Lockless Programming Techniques for safe multi-threaded data sharing without locks
Pros:
May have lower overhead
Avoids deadlocks
May reduce contention
Avoids priority inversions
Cons
Very limited abilities
Extremely tricky to get right
Generally non-portable Are you sure you really need lockless programming?
Limited abilities because your most powerful primitive is an atomic CompareAndSwap, and subtle and dangerous.
Are you sure you really need lockless programming?
Limited abilities because your most powerful primitive is an atomic CompareAndSwap, and subtle and dangerous.
10. Job Queue Again Assigning work:
EnterCriticalSection( &workItemsLock );
workItems.push( workItem );
LeaveCriticalSection( &workItemsLock );
Worker threads:
EnterCriticalSection( &workItemsLock );
WorkItem workItem = workItems.front();
workItems.pop();
LeaveCriticalSection( &workItemsLock );
DoWork( workItem );
Here’s our old friend the job queue with locksHere’s our old friend the job queue with locks
11. Lockless Job Queue #1 Assigning work:
EnterCriticalSection( &workItemsLock );
InterlockedPushEntrySList( workItem );
LeaveCriticalSection( &workItemsLock );
Worker threads:
EnterCriticalSection( &workItemsLock );
WorkItem workItem =
InterlockedPopEntrySList();
LeaveCriticalSection( &workItemsLock );
DoWork( workItem ); First we change the STL collections for an SList, which is a lockless thread-safe singly-linked list that is part of Windows.
First we change the STL collections for an SList, which is a lockless thread-safe singly-linked list that is part of Windows.
12. Lockless Job Stack #1 Assigning work:
InterlockedPushEntrySList( workItem );
Worker threads:
WorkItem workItem =
InterlockedPopEntrySList();
DoWork( workItem ); Then once we are using a lockless list, we can get rid of the critical selection. Lockless!
So, are we done? No. It is suboptimal because we no longer have a queue, we have a stack. So, our work items are no longer processed in order. Implementing data structures in a lockless manner is hard.
And, on Xbox 360 this is broken for reasons I’ll get to later. It should really be InterlockedPushEntrySListRelease, and InterlockedPopEntrySListAcquire to make this safe on all platforms, otherwise it is broken on Xbox 360.
What if we want a queue, and something lower overhead?Then once we are using a lockless list, we can get rid of the critical selection. Lockless!
So, are we done? No. It is suboptimal because we no longer have a queue, we have a stack. So, our work items are no longer processed in order. Implementing data structures in a lockless manner is hard.
And, on Xbox 360 this is broken for reasons I’ll get to later. It should really be InterlockedPushEntrySListRelease, and InterlockedPopEntrySListAcquire to make this safe on all platforms, otherwise it is broken on Xbox 360.
What if we want a queue, and something lower overhead?
13. Lockless Job Queue #2 Assigning work – one writer only:
if( RoomAvail( readPt, writePt ) ) {
CircWorkList[ writePt ] = workItem;
writePt = WRAP( writePt + 1 );
Worker thread – one reader only:
if( DataAvail( writePt, readPt ) ) {
WorkItem workItem =
CircWorkList[ readPt ];
readPt = WRAP( readPt + 1 );
DoWork( workItem ); Version #1 of our lockless job queue had the limitation of being a stack rather than a queue.
Version #2 of our lockless job queue has the limitation of having a single reader and single writer. It can be extended to multiple readers and writers by using interlocked operations, but this extension is tricky and reduces performance and will not be shown here.
Welcome to the limited world of lockless programming.
And, to top it all of, it’s broken. It is not reliable on any computer I know of because of compiler and CPU reordering.
Lockless programming is perilous. CPUs and compilers will do anything to make your code run faster, even if that changes the meaning of your code. This is always hidden from you when you write single-threaded code, or if you write multi-threaded code using locks, but as soon as you write lockless code you are in a scary world
This scary world is very unintuitive. CPUs will rearrange reads and writes in ways that seem very unintuitive. However there is a method to their madness, so the first step is to visualize what is going on, and why.
Version #1 of our lockless job queue had the limitation of being a stack rather than a queue.
Version #2 of our lockless job queue has the limitation of having a single reader and single writer. It can be extended to multiple readers and writers by using interlocked operations, but this extension is tricky and reduces performance and will not be shown here.
Welcome to the limited world of lockless programming.
And, to top it all of, it’s broken. It is not reliable on any computer I know of because of compiler and CPU reordering.
Lockless programming is perilous. CPUs and compilers will do anything to make your code run faster, even if that changes the meaning of your code. This is always hidden from you when you write single-threaded code, or if you write multi-threaded code using locks, but as soon as you write lockless code you are in a scary world
This scary world is very unintuitive. CPUs will rearrange reads and writes in ways that seem very unintuitive. However there is a method to their madness, so the first step is to visualize what is going on, and why.
14. Simple CPU/Compiler Model Read pC
Write pA
Write pB
Read pD
Write pC Simple model is that reads and writes happen instantaneously, one after another. This is how computers worked in the old days. 6502, 68000, these machines were so slow that memory could keep up and loads/stores happened exactly when you thought they did, with minimal pipelining.
But actually, reads and writes take a really long time – too long.
Note that ‘pass’ (as in reads pass writes) means a newer operation (read) passes an older operation (write). Note also that this refers to reads and writes from a single CPU core passing each other. Each processor has an independent stream of events
Point out that a write doesn’t take effect for other processors until it reaches shared storage.
Simple model is that reads and writes happen instantaneously, one after another. This is how computers worked in the old days. 6502, 68000, these machines were so slow that memory could keep up and loads/stores happened exactly when you thought they did, with minimal pipelining.
But actually, reads and writes take a really long time – too long.
Note that ‘pass’ (as in reads pass writes) means a newer operation (read) passes an older operation (write). Note also that this refers to reads and writes from a single CPU core passing each other. Each processor has an independent stream of events
Point out that a write doesn’t take effect for other processors until it reaches shared storage.
15. Alternate CPU Model Write pA
Write pB
Write pC
Visible order:
Write pA
Write pC
Write pB
Reads and writes take a really long time – too long. So instructions don’t wait for them.
Note that all of the instructions complete before any of the writes reach shared storage (L2), and notice that the writes don’t reach shared storage in order.
A write doesn’t take effect for other processors until it reaches shared storage.
Reads and writes take a really long time – too long. So instructions don’t wait for them.
Note that all of the instructions complete before any of the writes reach shared storage (L2), and notice that the writes don’t reach shared storage in order.
A write doesn’t take effect for other processors until it reaches shared storage.
16. Alternate CPU – Reads Pass Reads Read A1
Read A2
Read A1
Visible order:
Read A1
Read A1
Read A2
Point out that a read takes effect when the data leaves shared storage. The time that the data arrives at the processor core is not the relevant metric.
Point out that a read takes effect when the data leaves shared storage. The time that the data arrives at the processor core is not the relevant metric.
17. Alternate CPU – Writes Pass Reads Read A1
Write A2
Visible order:
Write A2
Read A1
18. Alternate CPU – Reads Pass Writes Read A1
Write A2
Read A2
Read A1
Visible order:
Read A1
Read A1
Write A2
Read A2
Note that at this point the CPU that wrote to A2 will see the new value, but no other CPU will!!!
Reads passing writes is the most expensive type of reordering for CPU designers to avoid.
Note that at this point the CPU that wrote to A2 will see the new value, but no other CPU will!!!
Reads passing writes is the most expensive type of reordering for CPU designers to avoid.
19. Memory Models "Pass" means "visible before"
Memory models are actually more complex than this
May vary for cacheable/non-cacheable, etc.
This only affects multi-threaded lock-free code!!!
* Only stores to different addresses can pass each other
** Loads to a previously stored address will load that value These are the four basic types of reordering. Actual causes may vary. We’ll see how they can all cause problems.
It is the job of the compiler and of the CPU to only do rearrangements which are safe for single-threaded code. If they rearrange your reads and writes in such a way that it breaks your single-threaded code (with the exception of device drivers) then they are broken.
This also doesn’t affect code that uses locks. Locks are always also memory barriers that prevent reordering so if you use, for instance, critical sections you can ignore these reordering problems.
These are the four basic types of reordering. Actual causes may vary. We’ll see how they can all cause problems.
It is the job of the compiler and of the CPU to only do rearrangements which are safe for single-threaded code. If they rearrange your reads and writes in such a way that it breaks your single-threaded code (with the exception of device drivers) then they are broken.
This also doesn’t affect code that uses locks. Locks are always also memory barriers that prevent reordering so if you use, for instance, critical sections you can ignore these reordering problems.
20. Improbable CPU – Reads Don’t Pass Writes Read A1
Write A2
Read A1 Note the painfully slow read/probe of A1 that is not allowed to pass the write of A2. Such a CPU would be so slow that it would be impractical.Note the painfully slow read/probe of A1 that is not allowed to pass the write of A2. Such a CPU would be so slow that it would be impractical.
21. Reads Must Pass Writes! Reads not passing writes would mean L1 cache is frequently disabled
Every read that follows a write would stall for shared storage latency
Huge performance impact
Therefore, on x86 and x64 (on all modern CPUs) reads can pass writes
22. Reordering Implications Publisher/Subscriber model
Thread A:
g_data = data;
g_dataReady = true;
Thread B:
if( g_dataReady )
process( g_data );
Is it safe? This algorithm is particularly nice because it has the lowest possible communication overhead. Interlocked operations are usually needed for lockless algorithms and they can be expensive, especially as processor counts rise. This algorithm can be thread safe while just using normal reads and writes. But, as written above it is not safe.
Common question: can threads see each other’s data? Yes, of course.
Follow-up: when? Certainly not instantly. 50-100 cycles later on my current machine. But that’s not useful.
More useful question is: how can I tell when my data is visible to another thread?
Common implementation: write some data, then set a flag.
Concrete question: if I write some data and then set a flag, will any threads be able to see the flag being set without seeing the data?
Now we’re getting somewhere.
Question is still too hard. Let’s break it into two parts:
1) If I write some data and then set a flag, is it possible for the flag to be set in shared memory before the data gets there?
2) If I read a flag and then if it is set read some data, is it possible for the data to have come from shared memory before the flag came from shared memory? (harder to think about)
Short answer to both: yes.
This algorithm is particularly nice because it has the lowest possible communication overhead. Interlocked operations are usually needed for lockless algorithms and they can be expensive, especially as processor counts rise. This algorithm can be thread safe while just using normal reads and writes. But, as written above it is not safe.
Common question: can threads see each other’s data? Yes, of course.
Follow-up: when? Certainly not instantly. 50-100 cycles later on my current machine. But that’s not useful.
More useful question is: how can I tell when my data is visible to another thread?
Common implementation: write some data, then set a flag.
Concrete question: if I write some data and then set a flag, will any threads be able to see the flag being set without seeing the data?
Now we’re getting somewhere.
Question is still too hard. Let’s break it into two parts:
1) If I write some data and then set a flag, is it possible for the flag to be set in shared memory before the data gets there?
2) If I read a flag and then if it is set read some data, is it possible for the data to have come from shared memory before the flag came from shared memory? (harder to think about)
Short answer to both: yes.
23. Publisher/Subscriber on PowerPC Proc 1:
Write g_data
Write g_dataReady
Proc 2:
Read g_dataReady
Read g_data
Writes may reach L2 out of order Proc 1 only does writes, and proc 2 only does reads, so write/read and read/write reordering is irrelevant. What happens with read/read and write/write reordering? Both can cause problems.
The writes to g_data and g_dataReady can be reordered, which can allow stale data to be read.Proc 1 only does writes, and proc 2 only does reads, so write/read and read/write reordering is irrelevant. What happens with read/read and write/write reordering? Both can cause problems.
The writes to g_data and g_dataReady can be reordered, which can allow stale data to be read.
24. Publisher/Subscriber on PowerPC Proc 1:
Write g_data
MyExportBarrier();
Write g_dataReady
Proc 2:
Read g_dataReady
Read g_data
Writes now reach L2 in order
Inserting an export barrier avoid this problem – writes now reach L2/shared storage in order.Inserting an export barrier avoid this problem – writes now reach L2/shared storage in order.
25. Publisher/Subscriber on PowerPC Proc 1:
Write g_data
MyExportBarrier();
Write g_dataReady
Proc 2:
Read g_dataReady
Read g_data
Reads may leave L2 out of order – g_data may be stale However, reads can also be reordered, even on in-proc machines. The second read may be satisfied from L1 meaning that it may end up with stale data. More precisely, it may end up with data that is staler than the first read. Stale data is unavoidable because the speed of light is finite, but inconsistently stale data is a problem. g_dataReady may be fresher than g_dataReady
What has happened is that the reads have left L2 out of order – g_data left L2 a long time ago – before this piece of code even executed. Other types of read-reordering may be possible.However, reads can also be reordered, even on in-proc machines. The second read may be satisfied from L1 meaning that it may end up with stale data. More precisely, it may end up with data that is staler than the first read. Stale data is unavoidable because the speed of light is finite, but inconsistently stale data is a problem. g_dataReady may be fresher than g_dataReady
What has happened is that the reads have left L2 out of order – g_data left L2 a long time ago – before this piece of code even executed. Other types of read-reordering may be possible.
26. Publisher/Subscriber on PowerPC Proc 1:
Write g_data
MyExportBarrier();
Write g_dataReady
Proc 2:
Read g_dataReady
MyImportBarrier();
Read g_data
It's all good! Inserting an import barrier prevents read reordering. The actual implementation is not documented, but the effect is to make sure that reads following the barrier are at least as fresh as reads before the barrier. One possible implementation would be to send a message down to L2 that takes long enough to return that any pending invalidates are guaranteed to have time to come back from L2.
The crucial takeaways are:
Reordering can happen both on the publisher and subscriber sides of data sharing, so barriers are generally needed in both places.
Barrier positioning is crucial – an export barrier prevents new writes from passing older writes. An import barrier ensures that new reads are at least as fresh as old writes. The positioning of the barrier determines what are ‘new’ reads/writes and what are ‘old’ reads/writes.
This reordering can happen on in-order CPUs. Even though instructions are executed in order the load/store subsystem may have significant reordering built-in. It depends on the CPU.
x86/x64 FTW!!!Inserting an import barrier prevents read reordering. The actual implementation is not documented, but the effect is to make sure that reads following the barrier are at least as fresh as reads before the barrier. One possible implementation would be to send a message down to L2 that takes long enough to return that any pending invalidates are guaranteed to have time to come back from L2.
The crucial takeaways are:
Reordering can happen both on the publisher and subscriber sides of data sharing, so barriers are generally needed in both places.
Barrier positioning is crucial – an export barrier prevents new writes from passing older writes. An import barrier ensures that new reads are at least as fresh as old writes. The positioning of the barrier determines what are ‘new’ reads/writes and what are ‘old’ reads/writes.
This reordering can happen on in-order CPUs. Even though instructions are executed in order the load/store subsystem may have significant reordering built-in. It depends on the CPU.
x86/x64 FTW!!!
27. x86/x64 FTW!!! Not so fast…
Compilers are just as evil as processors
Compilers will rearrange your code as much as legally possible
And compilers assume your code is single threaded
Compiler and CPU reordering barriers needed x86/x64 CPUs do not let reads pass reads or writes pass writes. But, this same problem can happen even on x86/x64 systems, because the compiler may do these rearrangements for you.
Compilers will do all four types of rearrangement (maybe more)
x86/x64 CPUs do not let reads pass reads or writes pass writes. But, this same problem can happen even on x86/x64 systems, because the compiler may do these rearrangements for you.
Compilers will do all four types of rearrangement (maybe more)
28. MyExportBarrier Prevents reordering of writes by compiler or CPU
Used when handing out access to data
x86/x64: _ReadWriteBarrier();
Compiler intrinsic, prevents compiler reordering
PowerPC: __lwsync();
Hardware barrier, prevents CPU write reordering
ARM: __dmb(); // Full hardware barrier
IA64: __mf(); // Full hardware barrier
Positioning is crucial!
Write the data, MyExportBarrier, write the control value
Export-barrier followed by write is known as write-release semantics Positioning is crucial! We need to make sure that g_data is written before g_dataReady, so the export barrier has to go between them.
It’s possible that we could use _WriteBarrier instead of _ReadWriteBarrier on x86/x64 but it’s not worth the risk.
__lwsync also works to prevent the compiler from reordering across it, so it is both a compiler and a CPU memory barrier.
Positioning is crucial! We need to make sure that g_data is written before g_dataReady, so the export barrier has to go between them.
It’s possible that we could use _WriteBarrier instead of _ReadWriteBarrier on x86/x64 but it’s not worth the risk.
__lwsync also works to prevent the compiler from reordering across it, so it is both a compiler and a CPU memory barrier.
29. MyImportBarrier(); Prevents reordering of reads by compiler or CPU
Used when gaining access to data
x86/x64: _ReadWriteBarrier();
Compiler intrinsic, prevents compiler reordering
PowerPC: __lwsync(); or isync();
Hardware barrier, prevents CPU read reordering
ARM: __dmb(); // Full hardware barrier
IA64: __mf(); // Full hardware barrier
Positioning is crucial!
Read the control value, MyImportBarrier, read the data
Read followed by import-barrier is known as read-acquire semantics Positioning is again crucial! We need to make sure that g_dataReady is read before g_data, so we place the import barrier between them.
isync can also be used to prevent read reordering, but lwsync is recommended on Xbox 360.
It’s possible that we could use _ReadBarrier instead of _ReadWriteBarrier on x86/x64 but it’s not worth the risk.
__lwsync prevents read passing read, write passing write, and write passing read
Does not prevent read passing write
Adding __lwsync after every instruction gives you the x86/x64 memory model
__sync is full barrier
Positioning is again crucial! We need to make sure that g_dataReady is read before g_data, so we place the import barrier between them.
isync can also be used to prevent read reordering, but lwsync is recommended on Xbox 360.
It’s possible that we could use _ReadBarrier instead of _ReadWriteBarrier on x86/x64 but it’s not worth the risk.
__lwsync prevents read passing read, write passing write, and write passing read
Does not prevent read passing write
Adding __lwsync after every instruction gives you the x86/x64 memory model
__sync is full barrier
30. Fixed Job Queue #2 Assigning work – one writer only:
if( RoomAvail( readPt, writePt ) ) {
MyImportBarrier();
CircWorkList[ writePt ] = workItem;
MyExportBarrier();
writePt = WRAP( writePtr + 1 );
Worker thread – one reader only:
if( DataAvail( writePt, readPt ) ) {
MyImportBarrier();
WorkItem workItem =
CircWorkList[ readPt ];
MyExportBarrier();
readPt = WRAP( readPt + 1 );
DoWork( workItem ); I have a nagging feeling that we actually need four barriers. As a general rule each thread in a lockless algorithm needs an import and an export barrier. The additional export barrier on the worker thread makes sure that the code doesn’t “free” the buffer space before it finishes reading from it (can happen during compiler reordering or writes passing reads). The additional import barrier on the writer thread avoids a possibly theoretical problem where write to the buffer are moved ahead of the check. This problem seems hard to imagine and the buffer was put in somewhat because it makes for greater symmetry which often implies correctness, however it is probably not needed on any processor except possibly the pathologically weird Alpha.
A variant of this job queue, called the ATG::LockFreePipe ships in the Xbox 360 SDK and the DirectX SDK. It initially was missing the second two barriers. It has been used by several titles as a low-overhead way of sending commands between pairs of threads, with two pipes (one in each direction) for each pair of threads.
I have a nagging feeling that we actually need four barriers. As a general rule each thread in a lockless algorithm needs an import and an export barrier. The additional export barrier on the worker thread makes sure that the code doesn’t “free” the buffer space before it finishes reading from it (can happen during compiler reordering or writes passing reads). The additional import barrier on the writer thread avoids a possibly theoretical problem where write to the buffer are moved ahead of the check. This problem seems hard to imagine and the buffer was put in somewhat because it makes for greater symmetry which often implies correctness, however it is probably not needed on any processor except possibly the pathologically weird Alpha.
A variant of this job queue, called the ATG::LockFreePipe ships in the Xbox 360 SDK and the DirectX SDK. It initially was missing the second two barriers. It has been used by several titles as a low-overhead way of sending commands between pairs of threads, with two pipes (one in each direction) for each pair of threads.
31. Dekker’s/Peterson’s Algorithm int T1 = 0, T2 = 0;
Proc 1:
void LockForT1() {
T1 = 1;
if( T2 != 0 ) {
…
Proc 2:
void LockForT2() {
T2 = 1;
if( T1 != 0 ) {
…
}
For this code to work we need to be guaranteed that the reads and writes happen in exact program order
x86/x64 reordering in this code could cause it not to work as expected
For this code to work we need to be guaranteed that the reads and writes happen in exact program order
x86/x64 reordering in this code could cause it not to work as expected
32. Dekker’s/Peterson’s Animation Proc 1:
Write T1
Read T2
Proc 2:
Write T2
Read T1
Epic fail! (on x86/x64 also) Reads passing writes. The fourth (and hardest to avoid) type of reordering.Reads passing writes. The fourth (and hardest to avoid) type of reordering.
33. Dekker’s/Peterson’s Animation Proc 1:
Write T1
MemoryBarrier();
Read T2
Proc 2:
Write T2
MemoryBarrier();
Read T1
It's all good!
34. Full Memory Barrier MemoryBarrier();
x86: __asm xchg Barrier, eax
x64: __faststorefence();
Xbox 360: __sync();
ARM: __dmb();
IA64: __mf();
Needed for Dekker's algorithm, implementing locks, etc.
Prevents all reordering – including preventing reads passing writes
Most expensive barrier type If you need a full memory barrier then you’re in luck, there is an existing macro which gives you this on x86, x64, and Xbox 360
The regular memory barriers are not sufficient for this. MyExportBarrier and MyImportBarrier don’t prevent reads from passing writes, on any platform.
Note that entering and exiting a lock such as a critical section does a full memory barrier.
If you need a full memory barrier then you’re in luck, there is an existing macro which gives you this on x86, x64, and Xbox 360
The regular memory barriers are not sufficient for this. MyExportBarrier and MyImportBarrier don’t prevent reads from passing writes, on any platform.
Note that entering and exiting a lock such as a critical section does a full memory barrier.
35. Dekker’s/Peterson’s Fixed int T1 = 0, T2 = 0;
Proc 1:
void LockForT1() {
T1 = 1;
MemoryBarrier();
if( T2 != 0 ) {
…
Proc 2:
void LockForT2() {
T2 = 1;
MemoryBarrier();
if( T1 != 0 ) {
…
}
36. Dekker’s/Peterson’s Still Broken int T1 = 0, T2 = 0;
Proc 1:
void LockForT1() {
T1 = 1;
MyExportBarrier();
if( T2 != 0 ) {
…
Proc 2:
void LockForT2() {
T2 = 1;
MyExportBarrier();
if( T1 != 0 ) {
…
}
Only a full barrier (expensive as it is) can fix Dekker’s/Peterson’s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel’s lfence or sfence or both (load fence or store fence or both) also doesn’t help – these are not full barriers and do not prevent reads from passing writes.Only a full barrier (expensive as it is) can fix Dekker’s/Peterson’s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel’s lfence or sfence or both (load fence or store fence or both) also doesn’t help – these are not full barriers and do not prevent reads from passing writes.
37. Dekker’s/Peterson’s Still Broken int T1 = 0, T2 = 0;
Proc 1:
void LockForT1() {
T1 = 1;
MyImportBarrier();
if( T2 != 0 ) {
…
Proc 2:
void LockForT2() {
T2 = 1;
MyImportBarrier();
if( T1 != 0 ) {
…
}
Only a full barrier (expensive as it is) can fix Dekker’s/Peterson’s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel’s lfence or sfence or both (load fence or store fence or both) also doesn’t help – these are not full barriers and do not prevent reads from passing writes.Only a full barrier (expensive as it is) can fix Dekker’s/Peterson’s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel’s lfence or sfence or both (load fence or store fence or both) also doesn’t help – these are not full barriers and do not prevent reads from passing writes.
38. Dekker’s/Peterson’s Still Broken int T1 = 0, T2 = 0;
Proc 1:
void LockForT1() {
T1 = 1;
MyExportBarrier(); MyImportBarrier();
if( T2 != 0 ) {
…
Proc 2:
void LockForT2() {
T2 = 1;
MyExportBarrier(); MyImportBarrier();
if( T1 != 0 ) {
…
}
Only a full barrier (expensive as it is) can fix Dekker’s/Peterson’s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel’s lfence or sfence or both (load fence or store fence or both) also doesn’t help – these are not full barriers and do not prevent reads from passing writes.Only a full barrier (expensive as it is) can fix Dekker’s/Peterson’s algorithm. A read/import barrier, a write/export barrier, or both, will not fix it. Using Intel’s lfence or sfence or both (load fence or store fence or both) also doesn’t help – these are not full barriers and do not prevent reads from passing writes.
39. What About Volatile? Standard volatile semantics not designed for multi-threading
Compiler can move normal reads/writes past volatile reads/writes
Also, doesn’t prevent CPU reordering
VC++ 2005+ volatile is better…
Acts as read-acquire/write-release on x86/x64 and Itanium
Doesn’t prevent hardware reordering on Xbox 360
Watch for atomic<T> in C++0x
Sequentially consistent by default but can choose from four memory models Note that on Itanium VC++ 2005+ will add the necessary hardware memory barriers. On x86/x64 it doesn’t need to. On these platforms marking g_dataReady as volatile is sufficient for publisher/subscriber.
On Xbox 360 it needs to add hardware memory barriers, but doesn’t, so it’s almost useless.
So, don’t use volatile for multi-threading. It’s behavior is far too inconsistent and it is not what it is intended for.
Note that on Itanium VC++ 2005+ will add the necessary hardware memory barriers. On x86/x64 it doesn’t need to. On these platforms marking g_dataReady as volatile is sufficient for publisher/subscriber.
On Xbox 360 it needs to add hardware memory barriers, but doesn’t, so it’s almost useless.
So, don’t use volatile for multi-threading. It’s behavior is far too inconsistent and it is not what it is intended for.
40. Double Checked Locking Foo* GetFoo() {
static Foo* volatile s_pFoo;
Foo* tmp = s_pFoo;
if( !tmp ) {
EnterCriticalSection( &initLock );
tmp = s_pFoo; // Reload inside lock
if( !tmp ) {
tmp = new Foo();
s_pFoo = tmp;
}
LeaveCriticalSection( &initLock );
}
return tmp; }
This is broken on many systems C++ compliant compilers are allowed to reorder the instructions from the Foo() constructor and place them after the assigning of tmp to s_pFoo, thus making a partially constructed object visible.
CPU reordering can also cause problems here.
Additionally, CPU reordering can mean that the first CPU (other than the creator) to use the newly created object may see the s_pFoo pointer but not yet see the contents of the object. Thus, two memory barriers are needed – one between construction and publication, the other between reading the pointer and using it.
C++ compliant compilers are allowed to reorder the instructions from the Foo() constructor and place them after the assigning of tmp to s_pFoo, thus making a partially constructed object visible.
CPU reordering can also cause problems here.
Additionally, CPU reordering can mean that the first CPU (other than the creator) to use the newly created object may see the s_pFoo pointer but not yet see the contents of the object. Thus, two memory barriers are needed – one between construction and publication, the other between reading the pointer and using it.
41. Possible Compiler Rewrite Foo* GetFoo() {
static Foo* volatile s_pFoo;
Foo* tmp = s_pFoo;
if( !tmp ) {
EnterCriticalSection( &initLock );
tmp = s_pFoo; // Reload inside lock
if( !tmp ) {
s_pFoo = (Foo*)new char[sizeof(Foo)];
new(s_pFoo) Foo; tmp = s_pFoo;
}
LeaveCriticalSection( &initLock );
}
return tmp; } Here is an example of what a compiler rewrite of the initialization might look like. The compiler can allocate the memory, assign the pointer, and then construct the object. This is perfectly legal because it makes no difference to a single-threaded program.
Here is an example of what a compiler rewrite of the initialization might look like. The compiler can allocate the memory, assign the pointer, and then construct the object. This is perfectly legal because it makes no difference to a single-threaded program.
42. Double Checked Locking Foo* GetFoo() {
static Foo* volatile s_pFoo;
Foo* tmp = s_pFoo; MyImportBarrier();
if( !tmp ) {
EnterCriticalSection( &initLock );
tmp = s_pFoo; // Reload inside lock
if( !tmp ) {
tmp = new Foo();
MyExportBarrier(); s_pFoo = tmp;
}
LeaveCriticalSection( &initLock );
}
return tmp; }
Fixed Note that VC++ 2005 and later on Windows makes this code work perfectly without barriers because of the volatile semantics – except on Xbox 360 where no barriers are inserted even though they are needed. Sigh…
The import barrier is frustrating because it is a small cost (__lwsync on PowerPC, no real cost on x86/x64) that must be paid every time GetFoo() is called. Many PowerPC programmers which that there was a type of export barrier which would sync all the way to other cores instead of just to shared storage so that an import barrier would not be needed. Unfortunately this would require halting all other cores and then waiting for all writes and invalidates to propagate before restarting the other cores. There typically are operating system methods for doing this but they aren’t exposed and they are hugely expensive. So, don’t call GetFoo() too frequently – cache the result.
Also, if I’m reading the PowerPC documentation correctly, reading a pointer and then dereferencing the pointer gives you a guaranteed read order – the dereferenced reads are guaranteed to be fresher than the pointer, so the import barrier should be omittable, as long as the compiler can’t rearrange the reads. This is very subtle, will fail on some processors (Alpha for one), and is probably not worth it. Just cache the pointer.
Note that VC++ 2005 and later on Windows makes this code work perfectly without barriers because of the volatile semantics – except on Xbox 360 where no barriers are inserted even though they are needed. Sigh…
The import barrier is frustrating because it is a small cost (__lwsync on PowerPC, no real cost on x86/x64) that must be paid every time GetFoo() is called. Many PowerPC programmers which that there was a type of export barrier which would sync all the way to other cores instead of just to shared storage so that an import barrier would not be needed. Unfortunately this would require halting all other cores and then waiting for all writes and invalidates to propagate before restarting the other cores. There typically are operating system methods for doing this but they aren’t exposed and they are hugely expensive. So, don’t call GetFoo() too frequently – cache the result.
Also, if I’m reading the PowerPC documentation correctly, reading a pointer and then dereferencing the pointer gives you a guaranteed read order – the dereferenced reads are guaranteed to be fresher than the pointer, so the import barrier should be omittable, as long as the compiler can’t rearrange the reads. This is very subtle, will fail on some processors (Alpha for one), and is probably not worth it. Just cache the pointer.
43. InterlockedXxx Necessary to extend lockless algorithms to greater than two threads
A whole separate talk…
InterlockedXxx is a full barrier on Windows for x86, x64, and Itanium
Not a barrier at all on Xbox 360
Oops. Still atomic, just not a barrier
InterlockedXxx Acquire and Release are portable across all platforms
Same guarantees everywhere
Safer than regular InterlockedXxx on Xbox 360
No difference on x86/x64
Recommended Using InterlockedXxx unfortunately means that your code is not portable to Xbox 360. You should always (with extremely rare exceptions) use InterlockedXxx Acquire or Release – whichever is appropriate – if you need Xbox 360 portability.
InterlockedIncrement for reference counting is the most common time for wanting to avoid the cost of the Acquire and Release modifiers and the lwsync that they imply. If in doubt use InterlockedIncrementAcquire (because you are gaining access to the object and want to see other CPUs changes) and use InterlockedDecrementRelease (because you are giving up access to the object and want your changes to be visible). For immutable objects this isn’t necessary, but even immutable objects are modified at creation time, so be careful.
Using InterlockedXxx unfortunately means that your code is not portable to Xbox 360. You should always (with extremely rare exceptions) use InterlockedXxx Acquire or Release – whichever is appropriate – if you need Xbox 360 portability.
InterlockedIncrement for reference counting is the most common time for wanting to avoid the cost of the Acquire and Release modifiers and the lwsync that they imply. If in doubt use InterlockedIncrementAcquire (because you are gaining access to the object and want to see other CPUs changes) and use InterlockedDecrementRelease (because you are giving up access to the object and want your changes to be visible). For immutable objects this isn’t necessary, but even immutable objects are modified at creation time, so be careful.
44. Practical Lockless Uses Reference counts
Setting a flag to tell a thread to exit
Publisher/Subscriber with one reader and one writer – lockless pipe
SLists
XMCore on Xbox 360
Double checked locking Note that LockFreePipe ships in the Xbox 360 SDK and the DirectX SDKNote that LockFreePipe ships in the Xbox 360 SDK and the DirectX SDK
45. Barrier Summary MyExportBarrier when publishing data, to prevent write reordering
MyImportBarrier when acquiring data, to prevent read reordering
MemoryBarrier to stop all reordering, including reads passing writes
Identify where you are publishing/releasing and where you are subscribing/acquiring These three barrier should be portable and sufficient for all of your reordering prevent needs.These three barrier should be portable and sufficient for all of your reordering prevent needs.
46. Summary Prefer using locks – they are full barriers
Acquiring and releasing a lock is a memory barrier
Use lockless only when costs of locks are shown to be too high
Use pre-built lockless algorithms if possible
Encapsulate lockless algorithms to make them safe to use
Volatile is not a portable solution
Remember that InterlockedXxx is a full barrier on Windows, but not on Xbox 360
47. References Intel memory model documentation in Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide
http://download.intel.com/design/processor/manuals/253668.pdf
AMD "Multiprocessor Memory Access Ordering"
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
PPC memory model explanation
http://www.ibm.com/developerworks/eserver/articles/powerpc.html
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
http://msdn.microsoft.com/en-us/library/bb310595.aspx
Perils of Double Checked Locking
http://www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf
Java Memory Model Cookbook
http://g.oswego.edu/dl/jmm/cookbook.html AMD’s Multiprocessor Memory Access Ordering is currently in section 7.2 of their documentation.
Intel’s is also in section 7.2 of their documentation.
AMD’s Multiprocessor Memory Access Ordering is currently in section 7.2 of their documentation.
Intel’s is also in section 7.2 of their documentation.
48. Questions? bdawson@microsoft.com
49. Feedback forms Please fill out feedback forms