170 likes | 275 Views
PSWEEP: A Lightweight Pattern for Distributed Computational Experiments. Christopher Mueller and Andrew Lumsdaine Open Systems Lab, Indiana University. Introduction. Parameter Sweeps are common cluster applications Approaches Scripts (sh, perl: ssh, mpi)
E N D
PSWEEP: A Lightweight Pattern for Distributed Computational Experiments Christopher Mueller and Andrew Lumsdaine Open Systems Lab, Indiana University
Introduction • Parameter Sweeps are common cluster applications • Approaches • Scripts (sh, perl: ssh, mpi) • Low level applications (C++, Fortran: MPI) • Parameter sweep applications (e.g., Nimrod) • Problems • Custom solutions become tangled quickly • Applications are not available on all platforms
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 882576.aviss.av silin iq SL_DBJ014Q 14636 2 4 -- 200:0 R 109:4 890917.aviss.av baikgrp bg DA_NPJ001V 27673 1 2 -- 168:0 R 83:32 890932.aviss.av baikgrp bg DA_NPJ002V 18006 1 2 -- 168:0 R 87:31 959929.aviss.av rllord iq RL1_NCQ02V 11982 1 2 -- 120:0 R 56:27 960044.aviss.av shawnli bg Hairy2b 13703 1 1 -- 100:0 R 42:52 960045.aviss.av shawnli bg Xxbp1 21294 1 1 -- 100:0 R 42:51 960046.aviss.av shawnli bg Foxa1 15908 1 1 -- 100:0 R 42:49 960047.aviss.av shawnli bg Foxa2 19881 1 1 -- 100:0 R 42:49 960048.aviss.av shawnli bg Foxd3 19073 1 1 -- 100:0 R 42:49 960050.aviss.av shawnli bg Gsc 20886 1 1 -- 100:0 R 42:04 960215.aviss.av shawnli bg Foxa1mamma 18296 1 1 -- 100:0 R 35:23 960216.aviss.av shawnli bg Foxa2mamma 14926 1 1 -- 100:0 R 34:43 960217.aviss.av shawnli bg Foxd3mamma 15016 1 1 -- 100:0 R 34:43 960218.aviss.av shawnli bg Gata4mamma 7421 1 1 -- 100:0 R 33:11 960220.aviss.av shawnli bg Glimammal 7525 1 1 -- 100:0 R 33:11 960221.aviss.av shawnli bg Gscmammal 16626 1 1 -- 100:0 R 33:03 960222.aviss.av shawnli bg Hairy2bmam 16760 1 1 -- 100:0 R 33:03 960224.aviss.av shawnli bg Hoxd1mamma 32101 1 1 -- 100:0 R 33:01 960225.aviss.av shawnli bg Mixermamma 27958 1 1 -- 100:0 R 32:09 960279.aviss.av dkberry mdgrape run13_07m 5570 1 1 -- 36:00 R 17:04 960283.aviss.av dbaronia iq batch.sh 23862 3 6 -- 24:00 R 22:41 960426.aviss.av cwillenb bg CWOA_005 18980 1 1 -- 100:0 R 04:52 960428.aviss.av cwillenb bg CWOA_006a 1941 1 1 -- 100:0 R 04:52 960429.aviss.av cwillenb bg CWOA_007 -- 1 1 -- 100:0 Q -- 960430.aviss.av cwillenb bg CWOA_008 -- 1 1 -- 100:0 Q -- 960431.aviss.av cwillenb bg CWOA_009 -- 1 1 -- 100:0 Q -- 960432.aviss.av cwillenb bg CWOA_010 -- 1 1 -- 100:0 Q -- 960433.aviss.av cwillenb bg CWOA_011 -- 1 1 -- 100:0 Q -- 960434.aviss.av cwillenb bg CWOA_012 -- 1 1 -- 100:0 Q -- 963115.aviss.av xsong bg par.241 -- 8 16 -- 24:00 Q -- 963116.aviss.av xsong bg par.242 -- 8 16 -- 24:00 Q -- 963121.aviss.av xsong bg par.53.7 -- 8 16 -- 02:00 Q -- 963122.aviss.av xsong bg par.53.8 -- 16 32 -- 02:00 Q -- 963133.aviss.av honfan iq HF_MJ370 23299 3 6 -- 120:0 R 07:13 963167.aviss.av whpitcoc iq WP_C572_L0 30829 1 2 -- 24:00 R 01:11 963171.aviss.av whpitcoc iq WP_C572_L0 17995 1 2 -- 24:00 R 01:11 963186.aviss.av whpitcoc iq WP_C572_TS 5235 1 2 -- 24:00 R 00:08 963187.aviss.av whpitcoc iq WP_C572_TS 25746 1 2 -- 24:00 R 00:09 963188.aviss.av whpitcoc iq WP_C572_TS 13846 1 2 -- 24:00 R 00:09 963189.aviss.av whpitcoc iq WP_C572_TS 26613 1 2 -- 24:00 R 00:08 Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 882576.aviss.av silin iq SL_DBJ014Q 14636 2 4 -- 200:0 R 109:4 890917.aviss.av baikgrp bg DA_NPJ001V 27673 1 2 -- 168:0 R 83:32 890932.aviss.av baikgrp bg DA_NPJ002V 18006 1 2 -- 168:0 R 87:31 959929.aviss.av rllord iq RL1_NCQ02V 11982 1 2 -- 120:0 R 56:27 960044.aviss.av shawnli bg Hairy2b 13703 1 1 -- 100:0 R 42:52 960045.aviss.av shawnli bg Xxbp1 21294 1 1 -- 100:0 R 42:51 960046.aviss.av shawnli bg Foxa1 15908 1 1 -- 100:0 R 42:49 960047.aviss.av shawnli bg Foxa2 19881 1 1 -- 100:0 R 42:49 960048.aviss.av shawnli bg Foxd3 19073 1 1 -- 100:0 R 42:49 960050.aviss.av shawnli bg Gsc 20886 1 1 -- 100:0 R 42:04 960215.aviss.av shawnli bg Foxa1mamma 18296 1 1 -- 100:0 R 35:23 960216.aviss.av shawnli bg Foxa2mamma 14926 1 1 -- 100:0 R 34:43 960217.aviss.av shawnli bg Foxd3mamma 15016 1 1 -- 100:0 R 34:43 960218.aviss.av shawnli bg Gata4mamma 7421 1 1 -- 100:0 R 33:11 960220.aviss.av shawnli bg Glimammal 7525 1 1 -- 100:0 R 33:11 960221.aviss.av shawnli bg Gscmammal 16626 1 1 -- 100:0 R 33:03 960222.aviss.av shawnli bg Hairy2bmam 16760 1 1 -- 100:0 R 33:03 960224.aviss.av shawnli bg Hoxd1mamma 32101 1 1 -- 100:0 R 33:01 960225.aviss.av shawnli bg Mixermamma 27958 1 1 -- 100:0 R 32:09 960279.aviss.av dkberry mdgrape run13_07m 5570 1 1 -- 36:00 R 17:04 960283.aviss.av dbaronia iq batch.sh 23862 3 6 -- 24:00 R 22:41 960426.aviss.av cwillenb bg CWOA_005 18980 1 1 -- 100:0 R 04:52 960428.aviss.av cwillenb bg CWOA_006a 1941 1 1 -- 100:0 R 04:52 960429.aviss.av cwillenb bg CWOA_007 -- 1 1 -- 100:0 Q -- 960430.aviss.av cwillenb bg CWOA_008 -- 1 1 -- 100:0 Q -- 960431.aviss.av cwillenb bg CWOA_009 -- 1 1 -- 100:0 Q -- 960432.aviss.av cwillenb bg CWOA_010 -- 1 1 -- 100:0 Q -- 960433.aviss.av cwillenb bg CWOA_011 -- 1 1 -- 100:0 Q -- 960434.aviss.av cwillenb bg CWOA_012 -- 1 1 -- 100:0 Q -- 963115.aviss.av xsong bg par.241 -- 8 16 -- 24:00 Q -- 963116.aviss.av xsong bg par.242 -- 8 16 -- 24:00 Q -- 963121.aviss.av xsong bg par.53.7 -- 8 16 -- 02:00 Q -- 963122.aviss.av xsong bg par.53.8 -- 16 32 -- 02:00 Q -- 963133.aviss.av honfan iq HF_MJ370 23299 3 6 -- 120:0 R 07:13 963167.aviss.av whpitcoc iq WP_C572_L0 30829 1 2 -- 24:00 R 01:11 963171.aviss.av whpitcoc iq WP_C572_L0 17995 1 2 -- 24:00 R 01:11 963186.aviss.av whpitcoc iq WP_C572_TS 5235 1 2 -- 24:00 R 00:08 963187.aviss.av whpitcoc iq WP_C572_TS 25746 1 2 -- 24:00 R 00:09 963188.aviss.av whpitcoc iq WP_C572_TS 13846 1 2 -- 24:00 R 00:09 963189.aviss.av whpitcoc iq WP_C572_TS 26613 1 2 -- 24:00 R 00:08 How do we use our clusters?
Anatomy of a Parameter Sweep Parameters and Enumeration Order * for i in range(rank, n, size): if process: load_image(i) elif stats: query_image(i) for j in [1, 2, 4, 8]: if process: time(i, j) for k in [‘motion’, ‘gaussian’]: if process: process_image(i,j,k) elif stats: image_stats(i,j,k) else: print'ssh n%d run %d %d' % (i, j, k) if process: clear_process(k) elif bgi: clear_temp(k) if process: unload_image(i) * Resrouce distribution is handled by the execution enviroment, e.g. mpirun
Anatomy of a Parameter Sweep Tasks and Experiments for i in range(rank, n, size): if process: load_image(i) elif stats: query_image(i) for j in [1, 2, 4, 8]: if process: time(i, j) for k in [‘motion’, ‘gaussian’]: if process: process_image(i,j,k) elif stats: image_stats(i,j,k) else: print'ssh n%d run %d %d' % (i, j, k) if process: clear_process(k) elif bgi: clear_temp(k) if process: unload_image(i)
Anatomy of a Parameter Sweep Artifacts and Errors for i in range(rank, n, size): if process: load_image(i) elif stats: query_image(i) for j in [1, 2, 4, 8]: if process: time(i, j) for k in [‘motion’, ‘gaussian’]: if process: process_image(i,j,k) elif stats: image_stats(i,j,k) else: print'ssh n%d run %d %d' % (i, j, k) if process: clear_process(k) elif bgi: clear_temp(k) if process: unload_image(i)
process stats load_image() unload_image() query_image() image_stats() time() process_image() clear_process() ? Resources User’s View Experiments [0, n] [.01, .1, 1.0] script gen [10, 12, 14] print … [i, j, k] 0, 0.01, 10 0, 0.01, 12 0, 0.01, 14 0, 0.1, 10 0, 0.1, 12 … Parameters
Abstracting the Loops • Parameter. A Parameter is an iterator or container that supplies the values for a variable in the experiment. • Enumerator. The enumerator takes a ordered list of parameters and lexigraphically enumerates all possible values. • State. The state contains the current value of each parameter, in order. i = [‘house.jpg’, ‘lena.jpg’] j = [1, 2, 4, 8] K = [‘motion’, ‘gaussian’] params = [i, j, k] e = enumerator(params) for state in e: process_image(state)
Abstracting the Experiments • Task. A Task is any unit of work performed when a parameter value changes. A Task is subdivided into setup and cleanup operations, corresponding to the work done at the beginning and end of a block of code in a loop, respectively. • Experiment. An Experiment is a collection of tasks. defPrepareImage(state, img): # Setup db_load(img, './current.jpg') yield# suspend the function # Cleanup delete('./current.jpg') defProcessImage(state, alg): data = load('./current.jpg') img = process(data, alg(value)) save(img, str(state) + '.jpg') return# no cleanup
Binding Experiments to State Bound Task Semantics. Tasks must execute in the same order they would if the parameter sweep was expanded to nested loops. for img in images: PrepareImage.setup(img) for alg in algs: ProcessImage.setup(alg) PrepareImage.cleanup(img) e = enumerator([images, algs]) e.bind(images, PrepareImage) e.bind(algs, ProcessImage) for state in e: pass These examples are equivalent.
Distributing the Workload DistributedEnumerator. DistributedEnumerator is an Enumerator that distributes the state to multiple instances across multiple computing resources. e = RoundRobin(params) for state in e: pass States: p1: [house.jpg, 1, motion] p2: [house.jpg, 1, gaussian] [house.jpg, 2, motion] [house.jpg, 2, gaussian] [house.jpg, 4, motion] [house.jpg, 4, gaussian] [lena.jpg, 1, motion] [lena.jpg, 1, gaussian] [lena.jpg, 2, motion] [lena.jpg, 2, gaussian] [lena.jpg, 4, motion] [lena.jpg, 4, gaussian] e = Domain(params, images) for state in e: pass States: p1: [house.jpg, 1, motion] [house.jpg, 1, gaussian] [house.jpg, 2, motion] [house.jpg, 2, gaussian] [house.jpg, 4, motion] [house.jpg, 4, gaussian] p2: [lena.jpg, 1, motion] [lena.jpg, 1, gaussian] [lena.jpg, 2, motion] [lena.jpg, 2, gaussian] [lena.jpg, 4, motion] [lena.jpg, 4, gaussian] e = MasterWorker(params) for state in e: pass States: p1: [house.jpg, 1, motion] p2: [house.jpg, 1, gaussian] [house.jpg, 2, motion] [house.jpg, 2, gaussian] [house.jpg, 4, motion] [house.jpg, 4, gaussian] [lena.jpg, 1, motion] [lena.jpg, 1, gaussian] [lena.jpg, 2, motion] [lena.jpg, 2, gaussian] [lena.jpg, 4, motion] [lena.jpg, 4, gaussian] The DistributedEnumerators must ensure that bound state semantics are satisfied.
Implementations • Python • Designed around Iterators and Generators • DistribtedEnumerator based on pyMPI • Ideal for managing experiments on clusters • C++ • Template metaprogramming techniques remove abstraction penalties • Ideal for applications with many nested loops
C++ Example Generate HTML tables for days of the week with hours for the rows and minutes for the colums Task Classes Parameter Sweep structtable_task { voidsetup(State& state) { std::cout << "<table title=\""; print_last_param()(state); std::cout << "\">\n"; } voidcleanup(State&) { std::cout << "</table>\n"; } }; structtable_row_task { // As above with <tr> }; structtable_data_task { // As above with <td> }; intmain() { usingboost::make_tuple; sweep(make_tuple("Sat", "Sun" make_tuple(range(24) make_tuple(range(0,60,10)))) empty_state(). bind<0>(table_task()). bind<1>(table_row_task()). bind<2>(table_data_task()), print_last_param()); return 0; }
Conclusions • PSWEEP cleanly separates concerns • Parameters • Tasks • Resources • Modern languages enable flexible and high-performance implementations
Reference A Lightweight Pattern for Managing Distributed Computational Experiments Christopher Mueller, Douglas Gregor, and Andrew Lumsdaine. Submitted to HPDC 2006. http://www.osl.iu.edu/~chemuell/new/psweep.php