240 likes | 346 Views
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation. Background In order to generate efficient code, modern compilers must consider the architecture for which they generate code for.
E N D
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation
Background • In order to generate efficient code, modern compilers must consider the architecture for which they generate code for. • Therefore, the engineer who implements the compiler must be very familiar with the architecture. • There is rarely “one way” to write code for one solution. However, some implementations may be able to take better advantage of the architecture’s features than others and achieve higher performance. • Naturally, we always want the most “efficient” (fastest) solution.
Methodology • For our project, we performed research on the architecture of the AMD64 Athlon and Opteron processors. • While gathering information on the AMD64 architecture, we selected a subset of relevant optimization techniques that should “in theory” yield better performance than other similar approaches. • Using the Microsoft Macro Assembler (MASM), we implemented a series of 15 small sample programs where each program isolates a single optimization technique (or lack thereof).
Methodology II • After assembling all of the test programs, all were instrumented and profiled on a machine with a single core AMD64 Athlon processor. • We used AMD’s downloadable “Code Analyst” suite to profile the program’s behavior and collect results such as clock events, cache misses, dispatch stalls, and time to completion. • The goal was to determine which optimization techniques yielded the best performance gain and validate our assumptions pertaining to the system architecture.
Optimization Classes Examined • Decode • Memory Access • Arithmetic • Integer Scheduling • Loop Instruction Overhead • Loop Unrolling • Function Inline
Decode Optimization - Decoding IA32 instructions is complicated. Using single complex instructions rather than multiple simpler instructions reduces the number of instructions that must be decoded into micro-operations. Example: add edx, DWORD PTR [eax] instead of mov ebx, QWORD PTR [eax] add edx, ebx This optimization reduces: - decoder usage. - register-pressure (allocated registers in the register file). - data dependent instructions in the pipeline (stalls).
Memory Access Optimization - The L1 cache is implemented as 8 separate banks. Each line in the bank is 8 bytes wide. If two consecutive load instructions try to read from a different line in the same bank then the second instruction has to wait for the previous instruction to finish. If no bank conflict occurs, then both loads can complete during the same cycle. Example: mov ebx, DWORD ptr [edi] mov edx, DWORD ptr [edi + TYPE intarray] imul ebx, ebx imul edx, edx instead of mov ebx, DWORD ptr [edi] imul ebx, ebx mov edx, DWORD ptr [edi + TYPE intarray] imul edx, edx Assuming that each array value is 4 bytes in size, a bank conflict can not occur since both loads read from the same cache line.
Arithmetic Optimization - Dividing by 16 bit integers is faster than dividing by 32 bit integers. Example: mov dx, 0 mov ax, 65535 mov bx, 5 idiv bx instead of mov edx, 0 mov eax, 65535 mov ebx, 5 idiv ebx This optimization reduces: - Time to perform division
Integer Scheduling Optimization - Pushing data onto the stack directly from memory is faster than loading the value into a register first and then pushing the register onto the stack. Example: push [edi] instead of lea ebx, [edi] push ebx This optimization reduces: - register-pressure (allocated registers in the register file). - data dependent instructions in the pipeline (stalls).
Integer Scheduling Optimization - Two writes to different portions of the same register is slower than writing to different registers. Example: mov bl, 0012h ; Load the constant into lower word of EBX mov ah, 0000h ; Load the constant into upper word of EAX ; No false dependency on the completion ; of previous instruction since BL and AH are different ;registers. instead of mov bl, 0012h ; Load the constant into lower word of EBX mov bh, 0000h ; Load the constant into upper word of EBX ; Instruction has a false dependency on the completion ; of previous instruction since BL and BH share EBX. This optimization reduces: - Dependent instructions in the pipeline.
Loop Instruction Overhead Optimization - The “LOOP” instruction has an 8 cycle latency. Faster to use other instructions like decrement and jump. Example: mov ecx, LENGTHOF intarray dec ecx jnz L1 instead of mov ecx, LENGTHOF intarray loop L1 This optimization reduces: - Loop overhead latency.
Loop Unrolling Optimization - Unrolling the body of a loop reduces the total number of iterations that need to be performed which eliminates a great deal of loop overhead and overall faster execution. Example: lea ebx, [edi] ; Get the next element from memory into register push ebx ; Push the next element onto the stack pop ebx ; Pop the element from the stack lea ebx, 2[edi] ; Get the next element from memory into register push ebx ; Push the next element onto the stack pop ebx ; Pop the element from the stack lea ebx, 4[edi] ; Get the next element from memory into register push ebx ; Push the next element onto the stack pop ebx ; Pop the element from the stack lea ebx, 6[edi] ; Get the next element from memory into register push ebx ; Push the next element onto the stack pop ebx ; Pop the element from the stack lea ebx, 8[edi] ; Get the next element from memory into register push ebx ; Push the next element onto the stack pop ebx ; Pop the element from the stack instead of lea ebx, [edi] ; Get the next element from memory into register push ebx ; Push the next element onto the stack pop ebx
Function Inline Optimization - The body of small functions can replace the function call in order to reduce function call overhead. Example: mov edx, 0 ; Sign extend dividend. mov eax, 65535 ; Load the dividend mov ebx, 5 ; Load the divisor idiv ebx ; Perform the division instead of call DoDivision ; Perform the division
Decode Results Discussion LoadExecuteNoOp vs LoadExecuteWithOp Confirmed expectations. Data cache misses were about the same for both programs. The non-optimized program required 273144 / 225221 = 1.21x more cycles than the optimized program. The non-optimized program required 2508666 / 2008493 = 1.25x more instructions than the optimized program. The optimized program introduced .095736654 / .064443493 = 1.49x more stalls than the non-optimized program. The non-optimized program took 130687 / 107365 = 1.22x longer to finish than the optimized program. Even though the optimized program had a higher stall rate, the overall reduction in instructions and cycles created an overall net-performance gain.
Memory Access Results Discussion MemAccessNoOp vs MemAccessWithOp Did not confirm expectations. There was no real observable difference between either of these programs. Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences. We are guessing that the same micro-operations get generated even for the optimized program. LEA16 vs LEA32 Did not confirm expectations. There was no real observable difference between either of these programs. Both executed roughly the same number of cycles in the same period of time with the similar stall and cache miss occurrences.
Arithmetic Results Discussion DIVIDE32 vs DIVIDE16 Confirmed expectations. Data cache misses were about the same for both programs. Both programs executed roughly the same number of instructions. However, the DIVIDE32 took 44186 / 28232 = 1.57x as many cycles as DIVIDE16 to finish. DIVIDE16 finished 21898 / 13979 = 1.57x faster than DIVIDE32. The instructions per cycle decreased by 0.177823038 / 0.113662699 = 1.56x for the DIVIDE32. The stalls per instruction ratio increased by slightly in the DIVIDE32 program .039177269 / .036802582 = 1.0036x. As expected, the 32bit division appeared to run significantly slower than 16 bit division.
Integer Scheduling Results Discussion IssueNoOp vs IssueWithOp Did not confirm expectations. The optimized program required 271579 / 256887 = 1.06x more cycles than the non-optimized program. The non-optimized program required 2504269 / 2007048 = 1.25x more instructions. The optimized program had 315596 / 941 = 3468.1x more cache misses than the non optimized program. The optimized program achieved 0.974852367 / 0.739029159 = 1.32x fewer instructions per cycle than the non-optimized program. It took the optimized program 131240 / 124485 = 1.05x longer to finish than the non-optimized program. While the optimization did reduce the code density, the cache miss rate increased greatly which diminished any performance returns to the point that the performance actually became worse.
Integer Scheduling Results Discussion PartialRWNoOp vs PartialRWWithOp Did not confirm expectations. Both of these programs had very close to the same performance profile. We expected to see the number of stalls reduce in the optimized program since the false dependencies were eliminated. However both had about the same number of measured stalls and both finished execution in about the same amount of time.
Loop Instruction Overhead Results Discussion IssueNoOp vs IssueWithNoLOOP Confirmed expectations. Data cache misses were about the same for both programs. Replacing the LOOP instruction with dec/jump had the following effects: Number of cycles reduced by 256887 / 134073 = 1.91x Instruction count increased by 3005102 / 2504269 = 1.20x Stalls increased by .041105427 / .015574844 = 2.64x Total runtime decreased by 124485 / 62244 = 2.0x While the overall instruction count and stall count increased, the total number of cycles needed was reduced by almost half which allowed a great performance gain.
Loop Unrolling Results Discussion IssueNoOp vs IssueWithLoopUnrolled Confirmed expectations. Unrolling the loop body 5 times had the following effects: Number of cycles reduced by 256887 / 137075 = 1.87x Number of instructions reduced by 2504269 / 1707411 = 1.47x Cache miss rate was slightly less 941 / 696 = 1.35x Instructions per cycle increased 1.245603502 / .974852367 = 1.28x Stalls increased by .182267773 / .015574844 = 11x Total runtime decreased by 124485 / 64344 = 1.93x Even though more stalls were introduced by the optimization, the total number of required cycles decreased significantly. Much of the loop overhead was removed which is what allowed an overall net performance increase.
Function Inline Results Discussion DIVIDE32 vs DIVIDE32FuncCall Confirmed expectations. Data cache misses were about the same for both programs. The DIVIDE32FuncCall required 69730 / 50223 = 1.39x more instructions than the in-lined DIVIDE32. The instructions per cycle increased in the DIVIDE32FuncCall, but this is most likely a false positive as the function call overhead introduced more instructions into the pipeline. The inline DIVIDE32 program finished 22858 / 21898 = 1.043x faster than the function call implementation. The number of stalls increased by .043843396 / .039177269 = 1.12x in the function call implementation. The function call implementation required 45925 / 44186 = 1.04x as many clock cycles as the inline version. As expected, the added overhead of the function call made a noticeable impact on performance. Also note that we did not pass any parameters to the function. Had parameters been passed, it could be expected that the overhead would increase.
Most Significant Performance Gains 1.) Loop Instruction Overhead (2x speedup) 2.) Loop Unrolling (1.93x speedup) 3.) Arithmetic (1.57x speedup) 4.) Decode (1.22x speedup) 5.) Function Inline (1.043x speedup) 6.) Memory Access (0x speedup) 7.) Integer Scheduling (-1.05x speedup)
Thank you. Questions?