210 likes | 354 Views
Parallel Programming with PVM. Chris Harper. The Choice of PVM. Existing implementation of the most obvious way to setup a parallel cluster every node runs a daemon and any daemon can execute a program on the cluster by messaging daemons Reasonably well supported and widespread.
E N D
Parallel Programmingwith PVM Chris Harper
The Choice of PVM • Existing implementation of the most obvious way to setup a parallel cluster • every node runs a daemon and any daemon can execute a program on the cluster by messaging daemons • Reasonably well supported and widespread
Setup Problems • PVM difficult to setup • Unix novices • Remote Shell through PVM blocked for unknown reasons • Group schedule and lab access limitations
Solutions • Use SSH instead of RSH • Less convenient and harder to roll out installation • But it worked and is more secure (not that security mattered) • Install Fedora and PVM on home computer using VMWare to increase access to platform • More time to write code and troubleshoot problems
as root (su -): install pvm: > yum install pvm.i386 set env vars: in /etc/profile, append: PVM_ROOT=/usr/share/pvm3 PVM_ARCH=LINUX PVM_RSH=/usr/bin/ssh export PVM_ROOT PVM_ARCH PVM_RSH in /root/.bashrc, append: PVM_ROOT=/usr/share/pvm3 log out and back in for settings to take affect test env vars: > echo $PVM_ROOT test start pvm: > pvm create a public key for ssh (on master machine): > ssh-keygen -t rsa (use default settings) copy the key to slaves: > scp /root/.ssh/id_rsa.pub root@machine_name:/root/.ssh/authorized_keys (or if multiple masters, scp to a temp file and then cat to authorized keys) might need to restart
The Program:Parallel Merge/Quicksort Hybrid • Why use a Mergesort to test parallel execution? • Simple • The size of the data partitions can be controlled (though usually just divided into equal parts) • Can use any kind of sort on the data partitions that the slaves receive
Algorithm • Random list partitioned into N parts, where N is the greatest power of two less than the number of available (or selected) nodes • List parts sent to N slaves in sequence • Each slave does a quicksort on the received list and returns it to the master • The master takes each pair of list parts and merges them together until the final sorted list is achieved
Problems with Algorithm • Final merge is still serially executed • As the parallel part of the algorithm decreases execution time, the serial part increases. • As such, this algorithm does not scale up well • Requires power of 2 number of nodes • Ignores multiple processors on a node • Unless running on a multiprocessor machine with no other machines in PVM
[root@gemini LINUX]# ./master1 10000000 4 PARALLEL SORT Usage: executable [# items [# tasks]] # items - length of array to sort or -1 for default # tasks - force this many tasks, -1 for default, or 0 for serial Alloc 10000000 items ... OK Memory Use: 19.07 MB Randomizing ... OK 1 nodes available 4 tasks selected Using 4 parallel parts part: 0, pos: 0, len: 2500000, left: 7500000 part: 1, pos: 2500000, len: 2500000, left: 5000000 part: 2, pos: 5000000, len: 2500000, left: 2500000 part: 3, pos: 7500000, len: 2500000, left: 0 Spawning 4 worker tasks ... OK Sending data to slave tasks ... OK Task 262152 (gemini), Part 0 returned. Took 1.668 secs. Task 262153 (gemini), Part 1 returned. Took 1.667 secs. Task 262154 (gemini), Part 2 returned. Took 1.709 secs. Task 262155 (gemini), Part 3 returned. Took 1.688 secs. Elapsed Spawn Time: 0.003 secs Elapsed Tx Overhead Time: 0.328 secs Elapsed Rx Overhead Time: 0.258 secs Elapsed Total Comm Overhead Time: 0.585 secs Elapsed Parallel Time: 2.238 secs Elapsed Program Time: 2.241 secs
Second Attempt:Recursion in Parallel • Makes a list of all the available tasks and associated nodes • Every slave divides the list of available tasks (nodes) in half, keeping the first half and giving the second to the first node in the second half • Similarly, every slave divides their received unsorted list in half, keeping the first half and giving the second half to the same node in the second half of the host list • Until the node pool is depleted • After each task finishes sorting its half of the list, it waits to receive the second half (if it split its half previously) and merges it with the first half • The merged and sorted list is returned to the parent task
More Problems • Latency while one task waits for the second half of the list to return • Fixed by giving the other task less than half of the list • Reduces parallelism unfortunately • First recursive task has to duplicate the list in memory; recursion should start within the spawning (master) task • Still works best with a power of 2 number of nodes
[root@sagitarius LINUX]# ./master2 10000000 4 PARALLEL SORT Usage: executable [# items [# tasks]] # items - length of array to sort or -1 for default # tasks - force this many tasks, -1 for default, or 0 for serial Alloc 10000000 items ... OK Memory Use: 19.07 MB Randomizing ... OK 4 nodes available 4 tasks selected Task 0 Host: sagitarius Task 1 Host: virgo Task 2 Host: pisces-r Task 3 Host: leo Spawning root slave task ... OK Sending data to slave ... OK [sagitarius (0)]: Spawning Task: 2 (pisces-r) [sagitarius (0)]: Spawning Task: 1 (virgo) [pisces-r (2)]: Spawning Task: 3 (leo) [leo (3)]: Done. Returning. Times - Wait: 0.000 Comm: 0.010 Calc: 0.522 Slave: 0.784 [virgo (1)]: Done. Returning. Times - Wait: 0.000 Comm: 0.034 Calc: 1.882 Slave: 3.172 [pisces-r (2)]: Done. Returning. Times - Wait: 0.008 Comm: 0.054 Calc: 1.936 Slave: 3.320 [sagitarius (0)]: Done. Returning. Times - Wait: 0.064 Comm: 0.277 Calc: 6.775 Slave: 7.905 Elapsed Comm Overhead Time: 0.498 secs Elapsed Wait Overhead Time: 0.064 secs Elapsed Calculation Time: 8.657 secs Elapsed Program Time: 8.713 secs
[root@localhost LINUX]# ./master2 25 2 PARALLEL SORT Usage: executable [# items [# tasks]] # items - length of array to sort or -1 for default # tasks - force this many tasks, -1 for default, or 0 for serial Alloc 25 items ... OK Memory Use: 0.00 MB Randomizing ... OK 13602 20334 6641 11971 9234 4357 23162 31256 26410 19358 1770 9206 7125 32040 14798 1441 19603 11567 30520 21757 28494 18730 11100 16631 4147 1 nodes available 2 tasks selected Task 0 Host: localhost Task 1 Host: localhost Spawning root slave task ... OK Sending data to slave ... OK [localhost (0)]: Spawning Task: 1 (localhost) [localhost (1)]: Done. Returning. Times - Wait: 0.000 Comm: 0.000 Calc: 0.000 Slave: 0.007 [localhost (0)]: Done. Returning. Times - Wait: 0.008 Comm: 0.000 Calc: 0.000 Slave: 0.025 1441 1770 4147 4357 6641 7125 9206 9234 11100 11567 11971 13602 14798 16631 18730 19358 19603 20334 21757 23162 26410 28494 30520 31256 32040 Elapsed Comm Overhead Time: 0.000 secs Elapsed Wait Overhead Time: 0.008 secs Elapsed Calculation Time: 0.000 secs Elapsed Program Time: 0.050 secs
Future Optimizations • Implement a method to detect how many processors a single machine has and treat each processor as a viable node • Use new threads to wait for data returned from spawned tasks
Conclusion • Sorting data not a good use of parallel hardware • PVM better for: • Algorithms that run with almost completely independent parts • Algorithms that require a lot of computation for not much data