140 likes | 308 Views
MPI vs POSIX Threads. A Comparison. Overview. MPI allows you to run multiple processes on 1 host How would running MPI on 1 host compare with POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware Dual 6 Core (2 threads per core) 12 logical
E N D
MPI vs POSIX Threads A Comparison
Overview • MPI allows you to run multiple processes on 1 host • How would running MPI on 1 host compare with POSIX thread solution? • Attempting to compare MPI vs POSIX run times • Hardware • Dual 6 Core (2 threads per core) 12 logical • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt • Intel Xeon CPU E5 – 2667 (show schematic) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf • 2.96 GHz • 15 MB L3 Cache • All code / output / analysis available here: • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/
Specifics • Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory • Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys • Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. The network doesn’t get involved) • Only makes sense with 1 machine • Set up test bed • Try each step individually, check results, then automate • Use Matrix Matrix multiply code we developed over the semester • Everyone is familiar with the code and can make observations • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/pthread_matrix_21.c • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_3.c • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_no_mp.c • Use square matrices • Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of big ones) • Matrix A will be filled with 1-n Left to Right and Top Down • Matrix B will be the identity matrix • Can then check our results easily as A*B = A when B = identity matrix • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt • Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing
Matrix Sizes Third Column: Just the number of calculations inside the loop for calculating the matrix elements
Specifics cont. • About the runs • For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000) • Vary thread count 2-12 (POSIX) • Vary Processes 2-12 (MPI) • Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks) • Make observations about anomalies in the run times where appropriate • Caveats • All initial runs with no optimization for testing, but hey this is a class about performance • Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference) • First level optimization made a huge difference > 3 x improvement • GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html • Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated) • Not all optimizations are flag controlled • Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations • Oh No moment ** • Huge improvement in performance with optimized code, why? • What if the improvement in performance ( from compiler optimization) was due to the identity matrix? • Came back and made matrix B non Identity, same performance. Whew. • I now Believe the main performance improvement came from loop unrolling. • Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was? • Came back and made matrix B non Identity, same performance. Whew. • Ready to make the runs
Discussion • Please chime in as questions come up. • Process Explanation: (After initial testing and verification) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt • Attempted a 25,000 x 25,000 matrix • Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt • Not an issue for POSIX threads (until you run out of memory on the machine) swap • Settled on 12 Processes / Threads because of the number of cores available • Do you get enhanced or degraded performance by exceeding that number? • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt • Example of process space / top output (10,000 x 10,000) • Early testing, before runs started. Pre Optimization • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt
Time Comparison (still boring…)In all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison POSIX Doesn’t “catch” back up till 9 processes MPI Doesn’t “catch” back up till 11 processes
POSIX Threads Vs MPI Processes Run TimesMatrix Sizes 4000x4000 – 10,000 x 10,000
1600 x 1600 case • Straight C runs long enough to see top output (here I can see the memory usage) • threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix • Suspect some kind of boundary issue here, possibly “false sharing”? • Process fits entirely in shared L3 cache 15 MB x 2 = 30MB • Do same number of calculations but make initial array allocations larger (shown below) [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5) foreach? ./a.out foreach? End Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 ) foreach? ./a.out foreach? End Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs [rahnbj@rage ~/SUNY]$
Future Directions • POSIX Threads with Network memory? (NFS) • Combo MPI and POSIX Threads? • MPI to multiple machines, then POSIX threads ? • http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads • POSIX threads that launch MPI ? • Couldn’t get MPE running with MPIch (would like to re-investigate why) • Investigate optimization techniques • Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO • Rerun with non-identity B matrix and compare times <- DONE • Try different languages ie CHAPEL • Try different algorithms • Want to add OpenMP to the mix • Found this paper on OpenMPvs direct POSIX programming (similar tests) • http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf • For < 6 processes look at thread_affinity and assignment of threads to a physical processor