220 likes | 232 Views
Explore the golden principles of optimization, algorithms, implementations, and hardware performance techniques. Learn tricks, such as the prime number algorithm, and key concepts like parallelization and vectorization.
E N D
Optimization: The Art of Computing Intel Challenge experience and other tricks … Mathieu Gravey
Golden principle of Optimizing • Algorithm • Implementation • Hardware L o n g - t e r m P e r f o r m a n c e
Example: Prime Number Algorithm For i=2 to N boolisPrime=true; For j=2 to N If (mod(i,j)==0 and i != j) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm For i=2 to N boolisPrime=true; For j=2 to i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm For i=2 to N boolisPrime=true; For j=2 to √i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm // the job For i=2 to N boolisPrime=true; For j=2 to √i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm // the job For i=2 to N boolisPrime=true; For j=2 to √i If (mod(i,j)==0) isPrime=false; break; end if end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm // the job For i=2 to N boolisPrime=true; vectorize the job For j=2 to √i isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm // the job For i=3 to N step 2 boolisPrime=true; vectorize the job For j in √i step 2 isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm // the job For i=2 to N step 2 boolisPrime=true; vectorize the job For j=2 to √i step 2 isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumber End for
Example: Prime Number Algorithm // the job For i==2 to N boolisPrime=true; vectorize the job For j in listOfPrimeNumber and j<√i isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumberin order End for
Example: Prime Number Algorithm // the job For i==1 or i==5 in base 6, to N boolisPrime=true; vectorize the job For j in listOfPrimeNumberand j<√i isPrime = isPrime && (mod(i,j)!=0); end for if (isPrime) add i to the listOfPrimeNumberin order End for
Basic principles • Pareto principle • Structure • Parallelization • Vectorization inotes4you.files.wordpress.com
Basic principles • Start by the main issues • Global view critical issue • Monkey development • Start simple go to complex • Iterative process • Optimizing, start by slowing down • Global picture ! http://bestofpicture.com/
Rules Guidelines • Be lazy • Don’treinvent the wheel • Don’t be idle • Design pattern • Global variables areyourenemies • Don’t Overgeneralize
Rules Guidelines • Trust the compiler • Simple for you= simple for compiler | computer • Share your knowledge • Compiler
Rules Guidelines • Think different, try,change and try again … • Don’t aim for the Best, but something Good and Better
Concrete trick : Memory • Array vs. List • Prefetch | random access
Concrete trick : First step Optimization • Compiler optimization • icpcmyCodeFile –O3 -xhost–o myCompiledProgram • ⚠ -g • const • No-writes • inline • restrict/__restrict__ • No read updates • Loop-unroll • __builtin_expect((x),(y))
Concrete trick : OpenMP • Vectorization => SIMD • #pragma ompsimd • Multi-operation with one instruction • ⚠ non-aligned data • Multi-Thread • L3 cache-communication • Shared memory • How to use : • #pragma omp parallel for default(none) shared(x,y) fisratPrivate(array) reduction(max:MaxValue) schedule(static) • for(inti=0; i< 10000; i++){ something … } • #pragma omp critical • #pragma omp barrier
Multi-Chip | Multi-Sockets • NUMA (Non-uniformmemoryaccess) • slowerthan local memory • Position in memory => first touch • Parallelize the initialisation with : schedule(static) • readonly data => copy in eachlocal memory • Thread Affinity