360 likes | 383 Views
CS 201 Computer Systems Programming Chapter 11 x86 Microsoft Assembler. Herbert G. Mayer, PSU Status 6/28/2015. Introductory Notes.
E N D
CS 201Computer Systems ProgrammingChapter 11 x86 Microsoft Assembler Herbert G. Mayer, PSU Status 6/28/2015
Introductory Notes CS 200 has been eliminated from PSU CS curriculum, thus assembly language programming is de-emphasized. Some MS assembly language program will be covered in CS 201, but focus will be limited to reading and understanding .asm source programs, not to writing. Main assembler used here is Microsoft macro assembler, commonly known as masm. A version can be installed from Microsoft, but requires Visual C++ 2005 to be installed. Assembler mnemonics and symbols in masm are somewhat different from asm source emitted by gcc compilers; the latter being similar to SPARC asm, with % identifying regs, also being used in your CS 201 text book.
Introductory Notes Find a downloadable masm version 8.0 here: http://www.microsoft.com/en-us/download/details.aspx?id=12654 Or find references to Microsoft’s masm here: http://msdn.microsoft.com/en-us/library/afzk3475.aspx
Introductory Notes Assembly Language programs bridge the gap between low level machine binary instructions and higher level interface with human programmers. The former are required to accomplish execution on a digital computer; the latter are convenient tools of expression for programmers. Assembly language is a low-level, target machine specific interface. But assembler presents a level of abstraction over the raw HW. Users do not deal with the target in terms of bits that represent binary machine instructions. The assembler elevates user to the level of textual language, up from the level of binary object code.
Introductory Notes Common to many architectures is separation of data space, instruction space, and perhaps other areas of program logic. The x86 architecture embodies so called data segments, code segments, stack segments, and numerous of each of these if needed. Each segment is identified at run time by its respective segment register. For example, the code label next: is interpreted by the HW as seg:offset, where seg is the segment register cs, and offset is the distance in bytes of next from the code segment start. If the offset of next is 248x and the value in cs is 20030x, the resulting address is 200548x --note the left-shift of the segment address by 4 bits. This is how in days of old, a 16-bit computer crafted a 20-bit address range
Introductory Notes This lecture note introduces complete masm assembler source programs. Starting with the smallest possible complete assembly program, doing nothing but asking DOS for its assisted suicide, we progress to more sophisticated cases. The second example emits a single character, the next prints a complete string onto the screen, followed by masm conventions, allowing us to communicate with the assembler in an abbreviated way. We also discuss macros, simple procedures, and loops. The “Definitions” below are in alphabetical order; we cover them in logical order, to minimize forward referencing.
Syllabus • Motivation • Definitions • Null Program • Print Single Character • Print Character String • Assembler Abbreviations • Assembler Macros • Assembler Procs • Loops • Assemble and Link • References
Motivation for Assembler Almost impossible to communicate with machine on the binary level Assembler offers a significant level of abstraction from the machine bits, plus relocatability, symbolic names and addresses, and some program reuse Symbols permit easy definition and reference of data and code objects Microsoft’s masm even offers high level constructs, similar to high-level statements Assembler programming allows the highest level of control over the target machine directly And permits to achieve highest performance -for short code sections
Definitions x86 Address: identifying attribute of any distinguishable memory unit. On old x86 architecture a logical address is a pair seg : offset, translated by hardware into so called linear address. Segment and offset are 16 bits long each in real mode. The machine address, called a linear address, is 20 bits long, with the rightmost (low-order) 4 bits of a segment address implied to be 0, as a segment must be 16-byte-aligned Alignment: Attribute of an address a, requiring that a must lie on a specified boundary; for example, the address a must be even, or be evenly divisible by 4 or 512. The former case is also called modulo-2 alignment, the latter modulo-512 alignment. Note that aligned addresses have some (of their lower address) bits set to 0. Hence, if such addresses are stored in hardware, these 0s can be omitted, i.e. are implied, whenever the complete address is needed
Definitions Assembler: source to object translator, reading relocatable, abstract, machine specific source programs, translating them into binary object code. After linking, the binary code is executable Binary Object: strings of bits which, when interpreted by the target machine, are legal machine operations plus associated memory references. Jointly, these represent executable programs Code Segment: Subsection of an architecture’s memory which holds executable instructions with possibly embedded, immediate operands; could reside in ROM Data Segment: Subsection of an architecture’s memory which holds data being referenced or manipulated. Like any segment, a data segment is identified by a segment register, holding its start address. Such an address must be evenly divisible by 16 on x86 family processors
Definitions Offset: Distance of a named object (addressable unit) from the beginning of an area encompassing the name Paragraph: Range of contiguous memory addresses that is 16 bytes long, and whose first byte address is evenly divisible by 16; convention on old x86 architecture Relocation: Ability of digital computer information to be placed in any location of memory. For example, referring to data (or object code) by offsets relative to some start address allows the code to be placed anywhere, as long as the respective start address is always added at execution time Segment: A subsection of memory. It is identified by a segment register and holds either code, data, or stack space; usually adheres to some alignment constraint; on x86 this is 16-byte alignment
Definitions Stack: Data structure holding data that are accessed only in a particular way, named LIFO (last in first out). The amount of data varies over time. Increases of data are accomplished through an operation called pushing, decreases via popping (on the x86 architecture). A stack segment register points to the beginning of the stack, the base pointer to the end, and the stack pointer to the current and varying top Top of Stack: Select element on the stack that is accessible (visible). There may be other elements in the stack, hidden by the top element. Additional elements are created by pushing, and elements are removed by popping; on x86 the top is accessed by the sp register On Intel architecture, the stack conventionally grows downward, i.e. towards lower addresses. On other machines it is the reverse
Null Program Set up the program’s segments: code, data, and stack In sample below there is only a Code Segment Note the ’code’ string to identify that segment as a code segment Communicate the intended segment portion of seg:offset in the assume pseudo-instruction Define start address (actually offset) via label, here the user-defined label is start: A label is a user-defined identifier followed by colon, in code the segment Use DOS services: 4ch in the ah register will terminate, when encountered by DOS DOS services requested via INT 21h Specific DOS service is defined in register ah; possibly further parameters, as needed, are specified in other registers Return code zero means: no errors occurred Note comments, introduced by ; comment ends at line-end
Null Program ; Source file: out1.asm ; Author: Herb Mayer ; Purpose: simple, meaningless program, no data seg, no stack ; Assembler: Microsoft assembler, command ‘masm’; 16-bit version ; difference: in 16-bit mode registers are ax, bx, not: eax, ebx code_s segment 'code’ assume cs : code_s ; communicate implied seg register start: mov al, 00 ; termination code for DOS 21: All OK! mov ah, 4ch ; tell DOS to terminate, 4ch in ah int 21h ; call DOS routine 21h for help code_s ends ; end of [code] segment end start ; end’s argument defines start; typical MS ; sounds like Microsoft, say start to stop
Print Single Character This example also defines a data and a stack segment; though they remain completely unused dw 999 reserves (defines) an int word in data segment, initialized to 999; also not used, just show: how to define data dw 100 dup( 1234 ) defines 100 words, all initially 1234; not used DOS routine 21h is called for help: INT 21h Specify to DOS via value in ah, which type of help is needed E.g. value 2 in ah means: output 1 character, the one in dl So DOS routine 2 prints the character found in register dl Moving 4c00h into register ax is the same as 4ch into register ah and 00 into al, ax is a “word” register in Intel parlance They are just two bytes (byte registers) concatenated; and this will terminate the program without error
Print Single Character ; Purpose: simple program to output one character ; Assembler: Microsoft assembler, command "masm”; 16-bit data_s segment ; unused data segment dw 999 ; define a word, init to 999 data_s ends stack_s segment ; unused stack segment dw 100 dup( 0 ) ; reserve 100 words, init to 0 stack_s ends code_s segment 'code' ; THE Code Segment assume cs:code_s, ds:data_s start: mov ax, seg data_s ; initialize ds, indirectly mov ds, ax mov dl, '$' ; char literal to be output by DOS mov ah, 2h ; DOS call 2h emits char in dl int 21h ; call DOS routine 21h mov ax, 4c00h ; we wanna terminate, ah + al int 21h ; terminate finally via DOS call code_s ends ; repeat segment name at ends end start ; end says: Where to start
Print Character String Data Segment defines a string of bytes, initialized to some double-quote “ enclosed string literal, identified by msg Note the $ character at the end of a string literal Used as end criterion for DOS output routine 9 Stack segment is still dummy, holds also 10 strings, each of length 16, also unused just to show stack seg to students DOS routine 9 emits a character string terminated by ‘$’ Whose start address it finds in ds:offset, offset communicated in register dx Note the built-in function offset applied to a data label MS masm also provides built-in seg function to generate other part of address; not shown here
Print Character String ; Purpose: simple program to output character string data_s segment msg db "Hello class$"; note ’$’ termination data_s ends stack_s segment ; unused db 10 dup( "---S t a c k----" ) stack_s ends ; repeat the name; optional! code_s segment 'code’ assume cs:code_s, ds:data_s start: mov ax, seg data_s mov ds, ax mov dx, offset msg ; string 2 b output by DOS mov ah, 9h ; DOS call 9h emits string int 21h ; call DOS mov ax, 4c00h ; we wanna terminate, ah + al int 21h ; terminate finally via DOS code_s ends ; end code seg end start ; start execution here: at ‘start’
Assembler Abbreviations Directive .mode small allows for default abbreviations and assumptions For example data, code, stack, @data are predefined, as are assume statements Here another string is printed, “Hello”. Note again the $ terminator The macro @data is predefined by masm, same as seg data Note again offset function Note again DOS routine 9, to output string of characters at address found in register dx Program using .model small abbreviation is small 16-bit mode .code ends previous segment, if any (here data) and starts code segment .data ends previous segment, if any, and starts data segment
Assembler Abbreviations ; Source file: out4.asm ; note: 16-bit assembler ; Purpose: simpler program to output character string .model small ; assumes stack data code .stack 10h ; assumes name: stack, but unused .data ; assumes name: data hi db "Hello$" .code ; assumes name: code start: mov ax, @data ; @data predefined macro mov ds, ax ; now data segment reg set mov dx, offset hi ; string 2 b output by DOS mov ah, 9h ; DOS call 9h emits string int 21h ; call DOS mov ax, 4c00h ; we wanna terminate, ah + al int 21h ; terminate finally end start ; start here, at “start”!
Assembler Macros Tired of writing segment, and ends? The .model small allows defaults and abbreviations And macros make program source more readable, easier to maintain Macro can be defined anywhere in assembler source Introduced by user defined name and macro keyword Terminated by endm keyword Macros may have 0 or more parameters, to be used and expanded in place inside macro body Note: When you dis-assemble (C++ compiler option) you do not see any macro; all is expanded
Assembler Macros start macro ; no parameters movax, @data ; @data predefined macro mov ds, ax ; now data segment reg set endm ; end of start macro Put_Str macro Str ; one formal parameter, “Str” mov dx, offset Str ; string 2 b output by DOS mov ah, 9h ; DOS call 9h emits string int 21h ; call DOS endm ; end of Put_Str macro Done macro ret_code ; formal parameter “ret_code” mov ah, 4ch ; we wanna terminate, ah = 4c mov al, ret_code ; communicate return code int 21h ; terminate finally via DOS endm ; end of macro body of Done .model small ; predefined assumptions .data ; assumes segment name: data hi db "Hello$" ; terminate string with $ .code ; assumes segment name: code main: start Put_Str hi ; invoke macro Put_Str, w. hi Done 0 end main ; start at: main!
Assembler Procs Procs are the SW engineering tool for x86 assembly language programmers to modularize the SW design Assembler procedure identified by proc and endp Procedure can be called, provides syntactic grouping mechanism to form logical modules Syntax rule for procedure: the name does not allow ‘:’ as you saw for code labels Return instruction ret ends procedure body and allows return to the place of call Reminiscent of high-level construct
Assembler Procs ; Purpose: modular macro program to output string start macro ; no parameters mov ax, @data ; @data predefined macro mov ds, ax ; now data segment reg set endm ; end of “start” macro body Put_Str macro Str ; “Str” must be data label . . . other macros as before endm ; see earlier def of Put_Str macro .data ; assumes name: data hi db "Hello$" ; terminate string with $ .code ; assumes name: code main proc ; begin of procedure body start ; invoke “start” macro Put_Str hi ; invoke “Put_Str” w. actual Done 0 ; invoke “Done” with actual 0 ret ; unnecessary, unreachable main endp end main ; entry point is “main”
Loops Repeated execution is generally needed in SW, since the number of steps may vary with data values Special operations provided by x86 HW to speed-up loop overhead and execution On x86 architecture with lamentably few registers, ecxtakes on special role of loop-counter And the loop instructions does the following (in Pseudo Code): loop next is: (if (--ecx) != 0 ) goto next; Content in ecx is treated as unsigned 32-bit int New-erx86 instruction to test for dangerous, initial zero value, is: jcxz on x86 architecture
Loops ; loop is "countable”, since we know # of elements ; b4 start of loop; we know already at assembly time Char_Out = 2h ; magic # for DOS: output char in dl Num_El = 10h ; 16 elements in chars array[] .model small ; 16-bit mode .data Chars db "0123456789abcdef" .code main: start mov ah, Char_Out ; set up ah for DOS call mov bx, 0 ; initial index off 'chars' mov cx, Num_El ; we know # iterations a priori; cx due to small next: mov dl, chars[bx]; find next char, move into dl inc bx ; increment index register int 21h ; print it loop next ; try next one; could be 0 after -- ; fall through here . . .
Assemble and Link Microsoft’s old Macro Assembler masm 5.10 to 8.0 Borland’s Macro Assembler tasm Microsoft’s newer Macro Assembler ml 6.22 Again: Microsoft masm assembler 8.0 for 32-bit processors here: http://www.microsoft.com/en-us/download/details.aspx?id=12654 Microsoft masm for x64 here: http://msdn.microsoft.com/en-us/library/hb5z4sxd.aspx Microsoft Linker link Borland Linker tlink
Assemble and Link The Microsoft macro assembler old version (up to about 2003 with .NET 2003) is named masm. Newer assembler product from Microsoft is named ml. This section summarizes the masm command Users should consult on-line help by typing masm/h to get more detailed information. The masm command version 5.10 and older has 4 arguments, separated from one another by commas. These arguments are file names. Arguments are considered omitted, if no comma (and thus no file name) is given. The assembler prompts you for each omitted one, so it is generally better to provide them, at least the commas, lest there will be repeated interaction with the assembler asking for file names, or hitting of carriage returns
Assemble Command If commas without file names are given, then default file names are assumed. The four file names, which are the arguments of the masm command, are left to right: assembly source program, say source.asm object program generated by assembler, say source.obj the listing, generated by the assembler, say source.lst the cross-reference file, named source.crf The suffixes obj, lst, and crf are automatically generated by the assembler, if no other names are provided Some complete masm commands, for the assembler file src1.asm would be: masm src1.asm, src.obj, src.lst, src.crf ; no prompting masm src1,src1,src1,src1 ; no prompting masm src1,src1.obj,src1,src1.crf ; no prompting masm src1,,,; ; no prompting In the above cases, masm will not prompt you, because you provided all file names. It was smart enough to provide suffixes (like .lst and .obj) from the respective positions
Link Command Link also has 4 arguments, 1 input file and 3 output files. Input is the object to be linked. The object may be a concatenation of multiple object files (typically ending in the .obj suffix), strung together by the + operator. For example: link mem0 + putdec,,, creates an executable mem0.exe. The file name mem0 is derived from the first part of the first argument, the suffix .exe is assumed. Also, the object file putdec.obj is used as input, to resolve some of the external names used in mem0.obj. The arguments of the link command, i.e. the 4 file names, are: object file or object files, concatenated by + with default suffix .obj the linked executable with suffix .exe the load map file, whose name ends in .map the library Com
Link Command If the input file is provided without suffix then the suffix .obj is assumed. If the executable file is specified without suffix, then .exe is assumed; any other file and explicit suffix is allowable too. The file for the load map should be specified; if none is provided then the file name nul is generated by the linker. And if no suffix is provided, then the .map suffix is assumed. Similarly, for the library a file name must be specified. The suffix is .lib. The commands below do not cause the linker to prompt you for additional file name inputs, because sufficient information is allowed to be assumed: link mem0 + putdec,,,, ; mem0.exe, no map, no library link mem0+putdex,foo.bar,,, ; generate executable foo.bar link putdec+mem0,mem0.exe,,, ; mem0.exe Com
Link Command Note that the concatenation operator + may be embedded in any number of blanks. Also the commas may be surrounded by blanks. The order of specifying the object files is immaterial, provided that the main entry point is unambiguous. The commands below cause the linker to prompt for some additional information: link mem0 + putdec ; ask for executable, map, and library link mem0+putdec,x.y ; ask for map and lib link putdec+mem0,, ; gen putdec.exe, ask for map and lib
Main Entry Point Each assembly unit concludes with an end directive(AKA end statement). This end statement may have a label, identifying one of the labels of proc names of the program. Such a label specifies the entry point, i.e. the initial value of eip, set by the loader. eip is the 32-bit instruction pointer However, if an executable is composed of multiple objects, there can and will be only a single entry point. All other source modules should not specify an argument after their endstatement If, however, two or more object modules to be linked into an executable do have an entry points specified, masm does not complain. Instead, it takes the first one of the objects listed as the first argument in the link command. And if this is not the intended entry point, program execution will bring surprises
References Free masm download: http://cvrce.blog.com/2009/08/28/masm-v611-free-download/ http://www.emsps.com/oldtools/msasmv.htm ML 64-bit: http://msdn.microsoft.com/en-us/library/s0ksfwcf(v=vs.80).aspx