Intel’s ‘cmpxchg’ instruction

Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?

Review of the i386 registers CS EAX DS EBX ES ECX FS EDX GS ESI SS EDI Segment Registers (16-bits) EBP ESP EIP General Registers (32-bits) EFLAGS Program Control and Status Registers (32 bits)

The x86 ‘system’ registers CR0 DR0 CR1 DR1 CR2 DR2 CR3 DR3 CR4 DR4 CR5 DR5 CR6 DR6 (16-bits) CR7 DR7 LDTR Control Registers (32-bits) Debug Registers (32-bits) TR GDTR means ‘unimplemented’ IDTR (48-bits)

How often is ‘cmpxchg’ used? $ cat vmlinux.asm | grep cmpxchg c01046de: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c01055d9: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c010b895: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c010b949: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c0129a9f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c0129acf: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c012d377: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c012d41a: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c012d968: f0 0f b1 16 lock cmpxchg %edx,(%esi) c012e568: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e57a: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e58a: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e83f: f0 0f b1 13 lock cmpxchg %edx,(%ebx) c012e931: f0 0f b1 0a lock cmpxchg %ecx,(%edx) c012ea94: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c012ecf4: f0 0f b1 13 lock cmpxchg %edx,(%ebx) c012f08e: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx) c012f163: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c013cb60: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c0148b3c: f0 0f b1 29 lock cmpxchg %ebp,(%ecx) c0150d0f: f0 0f b1 3b lock cmpxchg %edi,(%ebx) c0150d87: f0 0f b1 31 lock cmpxchg %esi,(%ecx) c0199c5e: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c024b06f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c024b2fe: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx) c024b321: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx) c024b34b: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx) c024b960: f0 0f b1 53 18 lock cmpxchg %edx,0x18(%ebx) Here’s the occurrence that we studied in the ‘rtc_cmos_read()’ kernel-function… …plus 28 other times!

Intel’s documentation • You can find out what any of the Intel x86 instructions does by consulting the official software developer’s manual, online at: http://www.intel.com/products/processor/manuals/index.htm • Our course-webpage has a link to this site that you can just click (under ‘Resources’) • The instruction-set reference is two parts: • Volume 2A: for opcodes A through M • Volume 2B: for opcodes N through Z

Example: ‘cmpxchg’ • Operation of the ‘cmpxchg’ instruction is described (on 3 pages) in Volume 2A • There’s an English-sentence description, and also a description in ‘pseudo-code’ • You probably do not want to print out this complete volume (.pdf) – over 700 pages! • (You could order a printed copy from Intel)

Instruction format • Intel’s assembly language syntax differs from the GNU/Linux syntax (known as ‘AT&T syntax’ with roots in UNIX history) • When AT&T syntax is used, the ‘cmpxchg’ instruction has this layout: [lock] cmpxchg reg, reg/mem mnemonic opcode source operand destination operand optional ‘prefix’ (used for SMP)

‘effects’ and ‘affects’ • According to Intel’s manual, the ‘cmpxchg’ instruction also uses two ‘implicit’ operands (i.e., operands not mentioned in the instruction) • The CPU’s accumulator register • The CPU’s EFLAGS register • The accumulator-register (EAX) is both a source-operand and a destination-operand • The six status-bits in the EFLAGS register will get modified, as a ‘side-effect’ this instruction

‘cmpxchg’ description • This instruction compares the accumulator with the destination-operand (so the ZF-bit in EFLAGS gets assigned accordingly) • Then: • If (accumulator == destination) { ZF  1; destination  source; } • If (accumulator != destination) { ZF  0; accumulator  destination; }

An instruction-instance • In our recent disassembly of Linux’s kernel function ‘rtc_cmos_read()’, this ‘cmpxchg’ instruction-instance was used: lock cmpxchg %edx, cmos_lock prefixopcodesource-operanddestination-operand Note: Keep in mind that the accumulator %eax will affect what happens! So we need to consider this instruction within it’s surrounding context

The complete function c0105574 <rtc_cmos_read>: c0105574: 53 push %ebx c0105575: 9c pushf c0105576: 5b pop %ebx c0105577: fa cli c0105578: 64 8b 15 08 20 30 c0 mov %fs:0xc0302008,%edx c010557f: 0f b6 c8 movzbl %al,%ecx c0105582: 42 inc %edx c0105583: c1 e2 08 shl $0x8,%edx c0105586: 09 ca or %ecx,%edx c0105588: a1 3c 99 30 c0 mov 0xc030993c,%eax c010558d: 85 c0 test %eax,%eax c010558f: 75 f7 jne c0105588 <rtc_cmos_read+0x14> c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c0105598: c0 c0105599: 85 c0 test %eax,%eax c010559b: 75 eb jne c0105588 <rtc_cmos_read+0x14> c010559d: 88 c8 mov %cl,%al c010559f: e6 70 out %al,$0x70 c01055a1: e6 80 out %al,$0x80 c01055a3: e4 71 in $0x71,%al c01055a5: e6 80 out %al,$0x80 c01055a7: c7 05 3c 99 30 c0 00 movl $0x0,0xc030993c c01055ae: 00 00 00 c01055b1: 53 push %ebx c01055b2: 9d popf c01055b3: 0f b6 c0 movzbl %al,%eax c01055b6: 5b pop %ebx c01055b7: c3 ret

The ‘preparation’ steps • The instructions that preceed ‘cmpxchg’ will setup register EDX (source operand) and register EAX (the x86 ‘accumulator’) • Several instructions are used to set up a value in EDX, and result in this layout: 31 8 7 0 The current processor’s value for ‘per_cpu__cpu_number’ plus 1 CMOS register’s index EDX: this might be zero… …but this part is guaranteed to be non-zero!

The ‘cmos_lock’ variable • This global variable is initialized to zero, meaning that access to CMOS memory locations is not currently ‘locked’ • If some CPU stores a non-zero value in this variable’s memory-location, it means that access to CMOS memory is ‘locked’ • The kernel needs to insure that only one CPU at a time can set this ‘lock’

The ‘most likely’ senario • One of the CPUs wishes to access CMOS memory – so it needs to test ‘cmos_lock’ to be sure that access is now ‘unlocked’ (i.e., cmos_lock == 0 is true) • The CPU copies the ‘cmos_lock’ variable into the EAX, where it can then be tested using the ‘test %eax, %eax’ instruction • A conditional-jump follows the test

The ‘busy-wait’ loop # Here is a ‘busy-wait’ loop, used to wait for the CMOS access to be ‘unlocked’ spin: mov cmos_lock, %eax # copy lock-variable to accumulator test %eax, %eax # was CMOS access ‘unlocked’? jnz spin # if it wasn’t, then check it again # A CPU will fall through to here if ‘unlocked’ access was detected, # and that CPU will now attempt to set the ‘lock’ – in other words, it # will try to assign a non-zero value to the ‘cmos_lock’ variable. # But there’s a potential ‘race’ here – the ‘cmos_lock’ might have been # zero when it was copied, but it could have been changed by now… # … and that’s why we need to execute ‘lock cmpxchg’ at this point

Busy-waiting will be brief spin: # see if the lock-variable is clear mov cmos_lock, %eax test %eax, %eax jnz spin # ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock # did another CPU grab it first? test %eax, %eax jnz spin If our CPU wins the ‘race’, the (non-zero) value from source-operand EDX will have been stored into the (previously zero) ‘cmos_lock’ memory-location, but the (previously zero) accumulator EAX will not have been modified; hence our CPU will not jump back, but will fall through and execute the ‘critical section’ of code (just a few instructions), then will promptly clear the ‘cmos_lock’ variable.

The ‘less likely’ case spin: # see if the lock-variable is clear mov cmos_lock, %eax test %eax, %eax jnz spin # ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock # did another CPU grab it first? test %eax, %eax jnz spin If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to some non-zero value after we had fetched our copy of it, then the (now non-zero) value from the ‘cmos_lock’ destination-operand will have been copied into EAX, and so the final conditional-jump shown above will take our CPU back into the spin-loop, where it will resume busy-waiting until the ‘winner’ of the race clears ‘cmos_lock’.

flowchart Setup nonzero value in EDX start no EAX is zero? EAX  cmos_lock yes EAX equals cmos_lock ? yes no ZF  1 cmos_lock  EDX ZF  0 EAX  cmos_lock EAX is zero? no yes critical section cmos_lock  0 finish

‘btr’/’bts’ versus ‘cmpxchg’ • In an earlier lesson we used the ‘btr’/’bts’ instructions to achieve ‘mutual exclusion’, whereas Linux uses ‘cmpxchg’ to do that • We think ‘btr’/’bts’ is easier to understand, so why do you think the Linux developers would prefer to use ‘cmpxchg’ instead? <allow some class discussion here>

In-class exercise #1 • Was it really necessary to insert a second ‘test %eax, %eax’ following ‘cmpxchg’? • Can you design a simple LKM that would verify your answer to that question?

EFLAGS • The Intel documentation does not state precisely how other EFLAGS status-bits (besides ZF) are affected by ‘cmpxchg’, only that they reflect the comparison of ‘accumulator’ and ‘destination’ operands • Usually the CPU implements comparison- of-operands by performing a subtraction 31 11 10 9 8 7 6 5 4 3 2 1 0 A C V M R F N T IOPL O F D F I F T F S F Z F 0 A F 0 P F 1 C F

In-class exercise #2 • Can you decide what Intel means by “the comparison operation”, by writing suitable code that examines the effect on EFLAGS of ‘cmpxchg opnd1, opnd2’ and these two plausable alternatives: cmp opnd1, opnd2 cmp opnd2, opnd1

Intel’s ‘cmpxchg’ instruction