1k likes | 1.18k Views
Finding the Limits of Hardware Optimization through Software De-optimization. De-optimizations ATTACK!!!. Derek Kern, Roqyah Alalqam, Ahmed Mehzer , Mohammed Mohammed. Presented By: . Outline. Flashback Project Structure Judging de-optimizations What does a de-op look like?
E N D
Finding the Limits of Hardware Optimization through Software De-optimization De-optimizationsATTACK!!! Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Presented By:
Outline • Flashback • Project Structure • Judging de-optimizations • What does a de-op look like? • General Areas of Focus • Instruction Fetching and Decoding • Instruction Scheduling • Instruction Type Usage (e.g. Integer vs. FP) • Branch Prediction • Idiosyncrasies
Outline • Our Methods • Measuring clock cycles • Eliminating noise • Something about the de-ops that didn’t work • Lots and lots of de-ops
Flashback During the research project • We studied de-optimizations • We studied the Opteron For the implementation project • We have chosen de-optimizations to implement • We have chosen algorithms that may best reflect our de-optimizations • We have implemented the de-optimizations • …And, we’re here to report the results
Flashback Judging de-optimizations (de-ops) • Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm. • So, our metric of choice will be CPU clock cycles What does a de-op look like? • A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question
Our Methods • The CPUs • AMD Opteron (Hydra) • Intel Nehalem (Derek’s Laptop) • Our primary focus was the Opteron • The de-optimizations were designed to affect the Opteron • We also tested them on the Intel in order to give you an idea of how universal a de-optimization is • When we know why something does or doesn’t affect the Intel, we will try to let you know
Our Methods • The code • Most of the de-optimizations are written in C (GCC) • Some of them have a wrapper that is written in C, while the code being de-optimized is written in NASM (assembly) • E.g. • Mod_ten_counter • Factorial_over_array • Typically, if a de-op is written in NASM, then the C wrapper does all of the grunt work prior to calling the de-optimized NASM module
Our Methods • Problem: How do we measure clock cycles? • An obvious answer • CodeAnalyst • Actually, we were getting strange results from CodeAnalyst • …And, it is hard to separate important code sections from unimportant code sections • …And, it is cumbersome to work with
Our Methods • A better answer • Embed code that measures clock cycles for important sections • Ok….but how? Answer: Read the CPU Timestamp Counter #if defined(__i386__) static __inline__ unsigned long longrdtsc(void) { unsigned long longint x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } #elif defined(__x86_64__) static __inline__ unsigned long longrdtsc(void) { unsigned hi, lo; __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi)); return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 ); } #endif
Our Methods • CPU Timestamp Counter • In all x86 CPUs since the Pentium • Counts the number of clock cycles since the last reset • It’s a little tricky in multi-core environments • Care must be taken to control the cores that do the relevant processing
Our Methods • CPU Timestamp Counter Windows: Runs the executable on core 3 (of 1 – 4) start /realtime /affinity 4 /b <exe name> <arguments> Linux (Hydra): Runs the executable on node 11, CPU 3 (of 0 – 11) bpsh 11 taskset 0x000000008 <exe name> <arguments> So, by restricting our runs to specific CPUs, we can rely on the CPU timestamp values
Our Methods • CPU Timestamp Counter • Wrapping code so that clock cycles can be counted // // Send the array off to be counted by the assembly code // unsigned long long start = rdtsc(); #ifdef _WIN64 mod_ten_counter( counts, mod_ten_array, size_of_array ); #elif _WIN32 mod_ten_counter( counts, mod_ten_array, size_of_array ); #elif __linux__ _mod_ten_counter( counts, mod_ten_array, size_of_array ); #endif printf( "Cycles=%d\n", ( rdtsc() - start ) ); The important section is wrapped and the number of clock cycles will be the difference between the start and the finish
Our Methods • Eliminating noisy results • Even with our precautions, there can be some noise in the clock cycles • So, we need lots of iterations that we can use to generate a good average • But, this can be very, very time consuming • How, oh how? Answer: The Version Tester
Our Methods • Eliminating noisy results – The Version Tester • Used to iteratively test executables • Expects each executable to return the number of cycles that need to be counted • Remember this? // // Send the array off to be counted by the assembly code // unsigned long long start = rdtsc(); #ifdef _WIN64 mod_ten_counter( counts, mod_ten_array, size_of_array ); #elif _WIN32 mod_ten_counter( counts, mod_ten_array, size_of_array ); #elif __linux__ _mod_ten_counter( counts, mod_ten_array, size_of_array ); #endif printf( "Cycles=%d\n", ( rdtsc() - start ) );
Our Methods • Eliminating noisy results – The Version Tester • Runs executables for specified number of iterations and then averages the number of cycles Example run on Hydra: > bpsh10 taskset 0x000000004 version_testermtc.hydra-core3.config Running Optimized for 1000 for 200 iterations Done running Optimized for 1000 with an average of 19058 cycles Running De-optimized #1 for 1000 for 200 iterations Done running De-optimized #1 for 1000 with an average of 21039 cycles Running Optimized for 10000 for 200 iterations Done running Optimized for 10000 with an average of 187296 cycles Running De-optimized #1 for 10000 for 200 iterations Done running De-optimized #1 for 10000 with an average of 206060 cycles Runs version_tester.exe on CPU 2 and mod_ten_counter.exe on CPU 3
Our Methods • Eliminating noisy results – The Version Tester • Running Command Format version_tester <tester_configuration> Configuration File (for Hydra) ITERATIONS=200 __EXECUTABLES__ Optimized for 1000=taskset 0x000000008 ./mod_ten_counter_op1000 De-optimized #1 for 1000=taskset 0x000000008 ./mod_ten_counter_deop 1000 Optimized for 10000=taskset 0x000000008 ./mod_ten_counter_op 10000 De-optimized #1 for 10000=taskset 0x000000008 ./mod_ten_counter_deop 10000 Optimized for 100000=taskset 0x000000008 ./mod_ten_counter_op 100000 De-optimized #1 for 100000=taskset 0x000000008 ./mod_ten_counter_deop 100000 Optimized for 1000000=taskset 0x000000008 ./mod_ten_counter_op 1000000 De-optimized #1 for 1000000=taskset 0x000000008 ./mod_ten_counter_deop 1000000
Our Methods • Eliminating noisy results – The Version Tester • Running Configuration File (for Windows): ITERATIONS=200 __EXECUTABLES__ Optimized for 10=.\mod_ten_counter\mod_ten_counter_op10 De-optimized #1 for 10=.\mod_ten_counter\mod_ten_counter_deop10 Optimized for 100=.\mod_ten_counter\mod_ten_counter_op100 De-optimized #1 for 100=.\mod_ten_counter\mod_ten_counter_deop100 Optimized for 1000=.\mod_ten_counter\mod_ten_counter_op1000 De-optimized #1 for 1000=.\mod_ten_counter\mod_ten_counter_deop1000 Optimized for 10000=.\mod_ten_counter\mod_ten_counter_op10000 De-optimized #1 for 10000=.\mod_ten_counter\mod_ten_counter_deop10000 Optimized for 100000=.\mod_ten_counter\mod_ten_counter_op100000 De-optimized #1 for 100000=.\mod_ten_counter\mod_ten_counter_deop100000 Optimized for 1000000=.\mod_ten_counter\mod_ten_counter_op1000000 De-optimized #1 for 1000000=.\mod_ten_counter\mod_ten_counter_deop1000000 Optimized for 10000000=.\mod_ten_counter\mod_ten_counter_op10000000 De-optimized #1 for 10000000=.\mod_ten_counter\mod_ten_counter_deop10000000
Our Methods • Eliminating noisy results – The Version Tester • Therefore, using the Version Tester, we can iterate hundreds or thousands of times in order to obtain a solid average number cycles • So, we believe our results fairly represent the CPUs in question
In what follows… • You are going to see the various de-optimizations that we implemented and the corresponding results • These de-optimizations were tested using the Version Tester and were executed while restricting the execution to a single core (CPU)
But first… • …something about the de-optimizations that were less than successful • Branch Patterns • Remember: We wanted to challenge the CPU with branching patterns that could force misses • This turned out to be very difficult to do • Random data caused a significant slowdown. But random data will break any branch prediction mechanism • The branch prediction mechanism on the Opteron is very very good
But first… • Unpredictable Instructions - Recursion • Remember: Writing recursive functions that call other functions near their return • This was supposed to overload the return address buffer and cause mispredictions • It turned out to be very difficult to implement • We never really showed any performance degradation • So, don’t worry about this one
So, without further adieu... The results of De-optimization
Dependency Chain De-Optimization Results Area: Instruction Scheduling
Dependency Chain: Flashback • Description • As we have seen in this class data dependency would have an impact on the ILP. • Dynamic scheduling as we saw can eliminate the WAW & WAR dependency • However, to a point the Dynamic scheduling could be overwhelmed which could affect the performance as we will see next • The Opteron • Opetron ,like all the other architectures, would be highly affected by the data hazard • The reason of this de-optimization is to show the impact of the data chain dependency on the performance
Dependency Chain • dependency_chain.exe • We implemented two versions of a program called ‘dependency_chain’ • The program takes an array size as argument • It then generates an array of the specified size in which each element is populated with integers x where 0 <= x <= 20 • The array’s element are being summed and the output would be the number of cycles that been taken by the program
Dependency Chain • dependency_chain.exe • In the optimized version adds the elements of the array by striding through the array in four element chunks and adding elements to four different temporary variables • Then the four temporary variables are added • The advantage is allowing four large dependency chain instead of one massive one • However for the de-optimized version, each of the element of the array are sums into one variable • This create a massive dependency chain which will quickly exhausts the scheduling resources of the dynamic scheduler
Dependency Chain • Dependency_chain.exe Source Optimized for ( i = 0; i < size_of_array; i+=4 ) { sum1 += test_array[i]; sum2 += test_array[i + 1]; sum3 += test_array[i + 2]; sum4 += test_array[i + 3]; } sum = sum1 + sum2 + sum3 + sum4; De-Optimized for ( i = 0; i < size_of_array; i++ ) { sum += test_array[i]; }
Dependency Chain: Results • dependency_chain.exe • chart below shows that not breaking up a dependency chain can be extraordinarily costly. On the Opteron, it caused ~150% for all array sizes. • The scheduling resources of the Opteron become overwhelmed essentially causing the program to run sequentially, i.e. with no ILP • Nehalem was impacted by this de-optimization too. Given the lesser impact, one can only imagine that it has more scheduling resources Difference between Optimized and De-Optimized Versions * In Clock Cycles
Dependency Chain: Upshot • Lessons • The code for the de-optimization is so natural that it is a little scary. It is elegant and parsimonious • However, this elegance and parsimony may come at a very high cost • If you don’t get the performance that you expect from a program, then it is definitely worth looking for these types of dependency chains • Break these chains up to give dynamic schedulers more scheduling options
High Instructions Latency De-Optimization Results Area: Instruction Fetching and Decoding
High Instructions Latency: Flashback • Description • CPUs often have instructions that can perform almost the same operation • Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized • The Opetron • The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles • Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization
High Instructions Latency • fib.exe • We implemented a program called ‘fib’ • It takes an array size as an argument • Fibonacci number is calculated for each element in the array
High Instructions Latency • fib.exe • Fibonacci number is calculated in assembly code • In the optimized version used dec & jnz instructions which take up to 4 cycles • In the de-optimized version used loop instruction which takes 8 cycles
High Instructions Latency • fib.exe Source Optimized De-Optimized calculate: movedx, eax add ebx, edx moveax, ebx movdword [edi], ebx add edi, 4 dececx jnz calculate calculate: movedx, eax add ebx, edx moveax, ebx movdword [edi], ebx add edi, 4 loop calculate
High Instructions Latency • fib.exe Compiled 08048481 <calculate>: 8048481: 89 c2 mov %eax,%edx 8048483: 01 d3 add %edx,%ebx 8048485: 89 d8 mov %ebx,%eax 8048487: 89 1f mov %ebx,(%edi) 8048489: 81 c7 04 00 00 00 add $0x4,%edi 804848f: 49 dec %ecx 8048490: 75 efjne 8048481 <calculate> Optimized 08048481 <calculate>: 8048481: 89 c2 mov %eax,%edx 8048483: 01 d3 add %edx,%ebx 8048485: 89 d8 mov %ebx,%eax 8048487: 89 1f mov %ebx,(%edi) 8048489: 81 c7 04 00 00 00 add $0x4,%edi 804848f: e2 f0 loop 8048481 <calculate> De-Optimized
High Instructions Latency: Results • fib.exe • In the chart below we can see that optimized version significantly outperforms the de-optimized version. The results on Nehalem are more impressive Difference between Optimized and De-Optimized Versions * In Clock Cycles
High Instructions Latency: Upshot • Lessons • As we have seen different instructions can affect your program when you don’t choose them carefully. • It’s important to know which instruction takes more cycles to avoid using them as possible.
Costly Instructions De-Optimization Results Area: Instruction Type Usage
Costly Instructions: Flashback • Description • Some instructions can do the same job but with more cost in term of number of cycles • The Opetron • Integer division for Opetron costs 22-47 cycles for signed, and 17-41 unsigned • While it takes only 3-8 cycles for both signed and unsigned multiplication
Costly Instructions • mult_vs_div_deop_1.exe & mult_vs_div_op.exe • We implemented two programs , optimized and de-optimized versions • They take an array size as an argument, the array initialized randomly with powers of 2 (less than or equal to 2^12) • The de-optimized version divides each element by 2.0. The optimized version multiplies each element by 0.5. • The versions are functionality equivalent
Costly Instructions • mult_vs_div_deop_1.exe & mult_vs_div_op.exe Optimized for ( i = 0; i < size_of_array; i++ ) { test_array[i] = test_array[i] * 0.5; } De-optimized for ( i = 0; i < size_of_array; i++ ) { test_array[i] = test_array[i] / 2.0; }
Costly Instructions: Results • mult_vs_div_deop_1.exe & mult_vs_div_op.exe • By looking to the chart below you can see this de-optimization has a huge impact on the Opetron of average 23% . It still has an affect on the Nehalem even it is not as big as the Opetron Difference between Optimized and De-Optimized Versions * In Clock Cycles
Costly Instructions: Upshot • Lessons • Small changes in your code could have a real impact on the performance • It so important to know the difference between instruction in term of cost • Seek discount instruction when it is possible
Costly Instructions De-Optimization Results Area: Instruction Type Usage
Costly Instructions: Flashback • Description • Some instructions can do the same job but with more cost in term of number of cycles Example: float f1, f2 if (f1<f2) This is a common usage for programmer which could be considered a de-optimization technique • The Opteron • Branches based on floating-point comparisons are often slow
Costly Instructions • Compare_two_floats.exe • We implemented a program called ‘Compare_two_floats’ • It takes a number of iteration as an argument • Comparisons between 2 floating numbers will be done in this program.
Costly Instructions • Compare_two_floats_deop.exe & Compare_two_floats_op.exe • In the de-optimized version we compare two floats by using the old common way as we will see in the next slide • However for the optimized version, we used change the float to integer and take it as a condition instead • The condition was specified on purpose to be not taken all the time