IA32 programming for Linux

IA32 programming for Linux Concepts and requirements for writing Linux assembly language programs for Pentium CPUs

A source program’s format • Source-file: a pure ASCII-character textfile • It’s created using a text-editor (such as ‘vi’) • You cannot use a ‘word processor’ (why?) • Program consists of series of ‘statements’ • Every program-statement fits on one line • Program-statements all have same layout • Designed in 1950s for IBM’s punch-cards

Statement Layout (1950s) • Each ‘statement’ was comprised of four ‘fields’ • Fields appear in a prescribed left-to-right order • These four fields were named (in order): -- the ‘label’ field -- the ‘opcode’ field -- the ‘operand’ field -- the ‘comment’ field • In many cases some fields could be left blank • Extreme case (very useful): whole line is blank!

Statement layout • Parsing an assembly language statement • A colon character (‘:’) terminates the label • A white-space character (e.g., blank or tab) separates the opcode from its operand(s) • A hash character (‘#’) begins the comment Label: opcode operand(s) # comment

The ‘as’ program • The ‘assembler’ is a computer program • It accepts a specified text-file as its input • It must be able to ‘parse’ each statement • It can produce onscreen ‘error messages’ • It can generate an ELF-format output file • (That file is known as an ‘object module’) • It can also generate a ‘listing file’ (optional)

The ‘label’ field • A label is a ‘symbol’ followed by a colon (‘:’) • The programmer invents his/her own ‘symbols’ • Symbols can use letters and digits, plus a very small number of ‘special’ characters ( ‘.’, ‘_’, ‘$’ ) • A ‘symbol’ is allowed to be of arbitrarily length • The Linux assembler (‘as’) was designed for translating source-text produced by a high-level language compiler (such as ‘cc’) • But humans can also write such files directly

The ‘opcode’ field • Opcodes are predefined symbols that are recognized by the GNU assembler • There are two categories of ‘opcodes’ (called ‘instructions’ and ‘directives’) • ‘Instructions’ represent operations that the CPU is able to perform (e.g., ‘add’, ‘inc’) • ‘Directives’ are commands that guide the work of the assembler (e.g., ‘.global’, ‘.int’)

Instructions vs Directives • Each ‘instruction’ gets translated by ‘as’ into a machine-language statement that will be fetched and executed by the CPU when the program runs (i.e., at ‘run time’) • Each ‘directive’ modifies the behavior of the assembler (i.e., at ‘assembly time’) • With GNU assembly language, these are easy to distinguish: directives begin with ‘.’

A list of the Pentium opcodes • An ‘official’ list of the instruction codes can be found in Intel’s programmer manuals: http://developer.intel.com • But it’s three volumes, nearly 1000 pages (it describes ‘everything’ about Pentiums) • An ‘unofficial’ list of (most) Intel instruction codes can fit on one sheet, front and back: http://www.jegerlehner/intel/

The AT&T syntax • The GNU assembler uses AT&T syntax (instead of official Intel/Microsoft syntax) so the opcode names differ slightly from names that you will see on ‘official’ lists: Intel-syntax AT&T-syntax --------------- ---------------------- ADD  addb/addw/addl INC  incb/incw/incl CMP  cmpb/cmpw/cmpl

The UNIX culture • Linux is intended to be a version of UNIX (so that UNIX-trained users already know Linux) • UNIX was developed at AT&T (in early 1970s) and AT&T’s computers were built by DEC, thus UNIX users learned DEC’s assembley language • Intel was early ally of DEC’s competitor, IBM, which deliberately used ‘incompatible’ designs • Also: an ‘East Coast’ versus ‘West Coast’ thing (California, versus New York and New Jersey)

Bytes, Words, Longwords • CPU Instructions usually operate on data-items • Only certain sizes of data are supported: BYTE: one byte consists of 8 bits WORD: consists of two bytes (16 bits) LONGWORD: uses four bytes (32 bits) • With AT&T’s syntax, an instruction’s name also incorporates its effective data-size (as a suffix) • With Intel syntax, data-size usually isn’t explicit, but is inferred by context (e.g., from operands)

The ‘operand’ field • Operands can be of several types: -- a CPU register may hold the datum -- a memory location may hold the datum -- an instruction can have ‘built-in’ data -- frequently there are multiple data-items -- and sometimes there are no data-items • An instruction’s operands usually are ‘explicit’, but in a few cases they also could be ‘implicit’

Examples of operands • Some instruction will have two operands: movl %ebx, %ecx addl $4, %esp • Some instructions will have one operand: incl %eax pushl $fmt • An instruction that lacks explicit operands: ret

The ‘comment’ field • An assembly language program often can be hard for a human being to understand • Even a program’s author may not be able to recall his programming idea after awhile • So programmer ‘comments’ can be vital! • A comment begin with the ‘#’ character • The assembler disregards all comments (but they will appear in program listings)

‘Directives’ • Sometimes called ‘pseudo-instructions’ • They tell the assembler what to do • The assembler will recognize them • Their names begin with a dot (‘.’) • Examples: ‘.section’, ‘.int,’ ‘.string’, … • The names of valid directives appears in the table-of-contents of the GNU manual

New program example • Let’s look at a demo program (‘squares.s’) • It will display a mathematical table showing some numbers and their squares • But it doesn’t use any multiplications! • It uses an algorithm based on algebra: (n+1)2 - n2 = n + n + 1 If you already know the square of a given number n , you can get the square of the next number n+1 by just doing additions

Visualizing the algorithm idea n n (n + 1)2 = n2 + 2n + 1

Our program uses a ‘loop’ • Here’s our program idea (expressed in C++) int num = 0, val = 0; do { val += num + num + 1; num += 1; printf( “ %d %d \n”, num, val ); } while ( num != 20 );

Some program ‘directives’ • ‘.equ’ – equates a symbol to a value: .equ MAX, 20 • ‘.string’ – just an alternative for ‘.asciz’: body: .string “ %d %d \n” • ‘.space’ – reserves uninitialized memory num: .space 4 # to hold an ‘int’

Some new ‘instructions’ • ‘inc’ – adds one to the specified operand: incl arg • ‘cmp’ – compares two specified operands: cmpl $MAX, arg • ‘je’ – jumps (to a specified instruction) if the condition ‘equal to’ is currently true: je again

Comparisons can be ‘tricky’ • It’s easy to get confused by AT&T syntax: mov $5, %eax while: add $-1, %eax cmp $5, %eax jge while (e.g., will this loop ever finish executing?) • REMEMBER: ‘compare’ means ‘subtract’

The FLAGS register O F D F I F T F S F Z F 0 A F 0 P F 1 C F Legend: ZF = Zero Flag SF = Sign Flag CF = Carry Flag PF = Parity Flag OF = Overflow Flag AF = Auxiliary Flag

In-class exercise #1 • How would you modify the source-code for the ‘squares’ program so that it prints out a larger table (i.e., more than 20 lines)? • Question: How many lines can you display on the screen before your program starts to show ‘wrong’ entries?

In-class exercise #2 • Can you write an enhanced version of our ‘squares.s’ program that would also show the cube for each number (but using only the ‘add’ and ‘inc’ arithmetic operations)? • HINT: Remember these algebra formulas: (n+1)3 - n3 = 3n2 + 3n + 1 3n = n + n + n

IA32 programming for Linux