1 / 22

Understanding Intel's cmpxchg Instruction & Linux Kernel's cmos_lock Mechanism

This article reviews the operation and usage of Intel's cmpxchg instruction and explains how the Linux kernel's cmos_lock mechanism works. It examines the i386 and x86 registers and provides examples from the rtc_cmos_read() kernel function.

fcole
Download Presentation

Understanding Intel's cmpxchg Instruction & Linux Kernel's cmos_lock Mechanism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel’s ‘cmpxchg’ instruction How does the Linux kernel’s ‘cmos_lock’ mechanism work?

  2. Review of the i386 registers CS EAX DS EBX ES ECX FS EDX GS ESI SS EDI Segment Registers (16-bits) EBP ESP EIP General Registers (32-bits) EFLAGS Program Control and Status Registers (32 bits)

  3. The x86 ‘system’ registers CR0 DR0 CR1 DR1 CR2 DR2 CR3 DR3 CR4 DR4 CR5 DR5 CR6 DR6 (16-bits) CR7 DR7 LDTR Control Registers (32-bits) Debug Registers (32-bits) TR GDTR means ‘unimplemented’ IDTR (48-bits)

  4. How often is ‘cmpxchg’ used? $ cat vmlinux.asm | grep cmpxchg c01046de: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c01055d9: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c010b895: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c010b949: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c0129a9f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c0129acf: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c012d377: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c012d41a: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c012d968: f0 0f b1 16 lock cmpxchg %edx,(%esi) c012e568: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e57a: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e58a: f0 0f b1 2e lock cmpxchg %ebp,(%esi) c012e83f: f0 0f b1 13 lock cmpxchg %edx,(%ebx) c012e931: f0 0f b1 0a lock cmpxchg %ecx,(%edx) c012ea94: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c012ecf4: f0 0f b1 13 lock cmpxchg %edx,(%ebx) c012f08e: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx) c012f163: f0 0f b1 11 lock cmpxchg %edx,(%ecx) c013cb60: f0 0f b1 0e lock cmpxchg %ecx,(%esi) c0148b3c: f0 0f b1 29 lock cmpxchg %ebp,(%ecx) c0150d0f: f0 0f b1 3b lock cmpxchg %edi,(%ebx) c0150d87: f0 0f b1 31 lock cmpxchg %esi,(%ecx) c0199c5e: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c024b06f: f0 0f b1 0b lock cmpxchg %ecx,(%ebx) c024b2fe: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx) c024b321: f0 0f b1 51 18 lock cmpxchg %edx,0x18(%ecx) c024b34b: f0 0f b1 4b 18 lock cmpxchg %ecx,0x18(%ebx) c024b960: f0 0f b1 53 18 lock cmpxchg %edx,0x18(%ebx) Here’s the occurrence that we studied in the ‘rtc_cmos_read()’ kernel-function… …plus 28 other times!

  5. Intel’s documentation • You can find out what any of the Intel x86 instructions does by consulting the official software developer’s manual, online at: http://www.intel.com/products/processor/manuals/index.htm • Our course-webpage has a link to this site that you can just click (under ‘Resources’) • The instruction-set reference is two parts: • Volume 2A: for opcodes A through M • Volume 2B: for opcodes N through Z

  6. Example: ‘cmpxchg’ • Operation of the ‘cmpxchg’ instruction is described (on 3 pages) in Volume 2A • There’s an English-sentence description, and also a description in ‘pseudo-code’ • You probably do not want to print out this complete volume (.pdf) – over 700 pages! • (You could order a printed copy from Intel)

  7. Instruction format • Intel’s assembly language syntax differs from the GNU/Linux syntax (known as ‘AT&T syntax’ with roots in UNIX history) • When AT&T syntax is used, the ‘cmpxchg’ instruction has this layout: [lock] cmpxchg reg, reg/mem mnemonic opcode source operand destination operand optional ‘prefix’ (used for SMP)

  8. ‘effects’ and ‘affects’ • According to Intel’s manual, the ‘cmpxchg’ instruction also uses two ‘implicit’ operands (i.e., operands not mentioned in the instruction) • The CPU’s accumulator register • The CPU’s EFLAGS register • The accumulator-register (EAX) is both a source-operand and a destination-operand • The six status-bits in the EFLAGS register will get modified, as a ‘side-effect’ this instruction

  9. ‘cmpxchg’ description • This instruction compares the accumulator with the destination-operand (so the ZF-bit in EFLAGS gets assigned accordingly) • Then: • If (accumulator == destination) { ZF  1; destination  source; } • If (accumulator != destination) { ZF  0; accumulator  destination; }

  10. An instruction-instance • In our recent disassembly of Linux’s kernel function ‘rtc_cmos_read()’, this ‘cmpxchg’ instruction-instance was used: lock cmpxchg %edx, cmos_lock prefixopcodesource-operanddestination-operand Note: Keep in mind that the accumulator %eax will affect what happens! So we need to consider this instruction within it’s surrounding context

  11. The complete function c0105574 <rtc_cmos_read>: c0105574: 53 push %ebx c0105575: 9c pushf c0105576: 5b pop %ebx c0105577: fa cli c0105578: 64 8b 15 08 20 30 c0 mov %fs:0xc0302008,%edx c010557f: 0f b6 c8 movzbl %al,%ecx c0105582: 42 inc %edx c0105583: c1 e2 08 shl $0x8,%edx c0105586: 09 ca or %ecx,%edx c0105588: a1 3c 99 30 c0 mov 0xc030993c,%eax c010558d: 85 c0 test %eax,%eax c010558f: 75 f7 jne c0105588 <rtc_cmos_read+0x14> c0105591: f0 0f b1 15 3c 99 30 lock cmpxchg %edx,0xc030993c c0105598: c0 c0105599: 85 c0 test %eax,%eax c010559b: 75 eb jne c0105588 <rtc_cmos_read+0x14> c010559d: 88 c8 mov %cl,%al c010559f: e6 70 out %al,$0x70 c01055a1: e6 80 out %al,$0x80 c01055a3: e4 71 in $0x71,%al c01055a5: e6 80 out %al,$0x80 c01055a7: c7 05 3c 99 30 c0 00 movl $0x0,0xc030993c c01055ae: 00 00 00 c01055b1: 53 push %ebx c01055b2: 9d popf c01055b3: 0f b6 c0 movzbl %al,%eax c01055b6: 5b pop %ebx c01055b7: c3 ret

  12. The ‘preparation’ steps • The instructions that preceed ‘cmpxchg’ will setup register EDX (source operand) and register EAX (the x86 ‘accumulator’) • Several instructions are used to set up a value in EDX, and result in this layout: 31 8 7 0 The current processor’s value for ‘per_cpu__cpu_number’ plus 1 CMOS register’s index EDX: this might be zero… …but this part is guaranteed to be non-zero!

  13. The ‘cmos_lock’ variable • This global variable is initialized to zero, meaning that access to CMOS memory locations is not currently ‘locked’ • If some CPU stores a non-zero value in this variable’s memory-location, it means that access to CMOS memory is ‘locked’ • The kernel needs to insure that only one CPU at a time can set this ‘lock’

  14. The ‘most likely’ senario • One of the CPUs wishes to access CMOS memory – so it needs to test ‘cmos_lock’ to be sure that access is now ‘unlocked’ (i.e., cmos_lock == 0 is true) • The CPU copies the ‘cmos_lock’ variable into the EAX, where it can then be tested using the ‘test %eax, %eax’ instruction • A conditional-jump follows the test

  15. The ‘busy-wait’ loop # Here is a ‘busy-wait’ loop, used to wait for the CMOS access to be ‘unlocked’ spin: mov cmos_lock, %eax # copy lock-variable to accumulator test %eax, %eax # was CMOS access ‘unlocked’? jnz spin # if it wasn’t, then check it again # A CPU will fall through to here if ‘unlocked’ access was detected, # and that CPU will now attempt to set the ‘lock’ – in other words, it # will try to assign a non-zero value to the ‘cmos_lock’ variable. # But there’s a potential ‘race’ here – the ‘cmos_lock’ might have been # zero when it was copied, but it could have been changed by now… # … and that’s why we need to execute ‘lock cmpxchg’ at this point

  16. Busy-waiting will be brief spin: # see if the lock-variable is clear mov cmos_lock, %eax test %eax, %eax jnz spin # ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock # did another CPU grab it first? test %eax, %eax jnz spin If our CPU wins the ‘race’, the (non-zero) value from source-operand EDX will have been stored into the (previously zero) ‘cmos_lock’ memory-location, but the (previously zero) accumulator EAX will not have been modified; hence our CPU will not jump back, but will fall through and execute the ‘critical section’ of code (just a few instructions), then will promptly clear the ‘cmos_lock’ variable.

  17. The ‘less likely’ case spin: # see if the lock-variable is clear mov cmos_lock, %eax test %eax, %eax jnz spin # ok, now we try to grab the lock lock cmpxchg %edx, cmos_lock # did another CPU grab it first? test %eax, %eax jnz spin If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to some non-zero value after we had fetched our copy of it, then the (now non-zero) value from the ‘cmos_lock’ destination-operand will have been copied into EAX, and so the final conditional-jump shown above will take our CPU back into the spin-loop, where it will resume busy-waiting until the ‘winner’ of the race clears ‘cmos_lock’.

  18. flowchart Setup nonzero value in EDX start no EAX is zero? EAX  cmos_lock yes EAX equals cmos_lock ? yes no ZF  1 cmos_lock  EDX ZF  0 EAX  cmos_lock EAX is zero? no yes critical section cmos_lock  0 finish

  19. ‘btr’/’bts’ versus ‘cmpxchg’ • In an earlier lesson we used the ‘btr’/’bts’ instructions to achieve ‘mutual exclusion’, whereas Linux uses ‘cmpxchg’ to do that • We think ‘btr’/’bts’ is easier to understand, so why do you think the Linux developers would prefer to use ‘cmpxchg’ instead? <allow some class discussion here>

  20. In-class exercise #1 • Was it really necessary to insert a second ‘test %eax, %eax’ following ‘cmpxchg’? • Can you design a simple LKM that would verify your answer to that question?

  21. EFLAGS • The Intel documentation does not state precisely how other EFLAGS status-bits (besides ZF) are affected by ‘cmpxchg’, only that they reflect the comparison of ‘accumulator’ and ‘destination’ operands • Usually the CPU implements comparison- of-operands by performing a subtraction 31 11 10 9 8 7 6 5 4 3 2 1 0 A C V M R F N T IOPL O F D F I F T F S F Z F 0 A F 0 P F 1 C F

  22. In-class exercise #2 • Can you decide what Intel means by “the comparison operation”, by writing suitable code that examines the effect on EFLAGS of ‘cmpxchg opnd1, opnd2’ and these two plausable alternatives: cmp opnd1, opnd2 cmp opnd2, opnd1

More Related