150 likes | 443 Views
MPI vs POSIX Threads. A Comparison. Overview. MPI allows you to run multiple processes on 1 host How would running MPI on 1 host compare with a similar POSIX thread solution? Attempting to compare MPI vs POSIX run times Hardware Dual 6 Core (2 threads per core) 12 logical
E N D
MPI vs POSIX Threads A Comparison
Overview • MPI allows you to run multiple processes on 1 host • How would running MPI on 1 host compare with a similar POSIX thread solution? • Attempting to compare MPI vs POSIX run times • Hardware • Dual 6 Core (2 threads per core) 12 logical • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt • Intel Xeon CPU E5 – 2667 (show schematic) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf • 2.96 GHz • 15 MB L3 Cache Shared 2.5MB per core • All code / output / analysis available here: • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/
About the Time Trials • Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory • Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys • Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar (ie. network latency isn’t the weak link. • So this analysis only makes sense on 1 machine • Use Matrix Matrix multiply code we developed over the semester • Everyone is familiar with the code and can make observations • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/pthread_matrix_21.c • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/matmat_3.c • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/src/matmat_no_mp.c • Use square matrices • Not necessary but it made things more convenient • Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of bigger ones) • Matrix A will be filled with 1-n Left to Right and Top Down • Matrix B will be the identity matrix • Can then check our results easily as A*B = A when B = identity matrix • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt • Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the processing • Set up test bed • Try each step individually, check results, then automate
Specifics cont. • About the runs • For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000) • Vary thread count 2-12 (POSIX) • Vary Processes 2-12 (MPI) • Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times caused by the system doing routine tasks) • With later runs I ran 12, dropped high and low then took average • Try Make observations about anomalies in the run times where appropriate • Caveats • All initial runs with no optimization for testing, but hey this is a class about performance • Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference) • First level optimization made a huge difference > 3 x improvement • GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html • Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated) • Not all optimizations are flag controlled • Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make some runs and observations • Oh No moment ** • Huge improvement in performance with optimized code, why? • I now Believe the main performance improvement came from loop unrolling. • Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the calculations I thought it was? • Came back and made matrix B non Identity, same performance. Whew. • OK - Ready to make the runs
Discussion • Please chime in as questions come up. • Process Explanation: (After initial testing and verification) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt • top –d .1 (tap 1 to show CPU list tap H to show threads) • Attempted a 25,000 x 25,000 matrix • Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices) • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt • Not an issue for POSIX threads (until you run out of memory on the machine) swap • Settled on 12 Processes / Threads because of the number of cores available • Do you get enhanced or degraded performance by exceeding that number? • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt • Example of process space / top output (10,000 x 10,000) • Early testing, before runs started. Pre Optimization • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt • Use >> top –d t (t in floating point secs ; linux) hit “1” key to see list of the cores • Take a look at some numbers • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_optmized-400-3000_ave.xlsx • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_optimized-4000-10000_ave.xlsx • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/MPI_optmized-400-3000_ave.xlsx • http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/MPI_optimized-4000-8000_ave.xlsx
Time ComparisonIn all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison POSIX Doesn’t “catch” back up till 9 processes MPI Doesn’t “catch” back up till 11 processes
POSIX Threads Vs MPI Processes Run TimesMatrix Sizes 4000x4000 – 10,000 x 10,000
MPI 1500 x 1500 – 1800 x 1800 Notice MPI Didn’t exhibit the same problem at size 1600 as POSIX and NO MP case.
POSIX & NO MP 1600 x 1600 case • Straight C runs long enough to see top output (here I can see the memory usage) • threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix • Suspect some kind of boundary issue here, possibly “false sharing”? • Process fits entirely in shared L3 cache 15 MB x 2 = 30MB • Do same number of calculations but make initial array allocations larger (shown below) [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5) foreach? ./a.out foreach? End Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secs Matrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs [rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 ) foreach? ./a.out foreach? End Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secs Matrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs [rahnbj@rage ~/SUNY]$
Notes / Future Directions • Start MPI Timer after communication. Is comsthe sole source of difference? <- TESTED NO • At the boundary conditions the driving force is the amount of memory allocated on the heap. • Not the number of calculations being performed • Intel had a nice article about false sharing: • https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads • link to a product they sell for detecting false sharing on their processors • Combo MPI and POSIX Threads? • MPI to multiple machines, then POSIX threads ? • http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads • Found this paper on OpenMPvs direct POSIX programming (similar tests) • http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf • Couldn’t get MPE running with MPIch (would like to re-investigate why) • Investigate optimization techniques • Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO • Rerun with non-identity B matrix and compare times <- DONE • Try different languages ie CHAPEL • Try different algorithms • For < 6 processes look at thread_affinity and assignment of threads to a physical processor • There is no gaurantee that with 6 or less processes they will all reside on same physical processor • Noticed CPU switching occaionally. • Setting the affinity can mitigate this, thread = assigned and not “allowed” to move
Notes / Future Directions cont. • Notice the shape of the curves for both MPI and POSIX solutions. There is definitely a point of diminishing returns. 6? In this particular case. • Instead of using 12 cores could we cut the problem set in half and launch 2 independent 6 process solutions by declaring thread_affinity? • Would this produce better results? • How to merge the 2 process spaces?