210 likes | 220 Views
Learn how research computing can overcome the limitations of typical PCs and laptops, and how Condor can speed up parallel processing for statistical applications. Explore the concepts of High Performance Computing (HPC) and High Throughput Computing (HTC) using Condor at the University of Liverpool.
E N D
Introduction to research computing using Condor Ian C. Smith* *Advanced Research Computing University of Liverpool
What’s special about research computing ? • Often researchers need to tackle problems which are far too demanding for a typical PC or laptop computer • Programs may take too long to run or … • require too much memory or … • too much storage (disk space) or … • all of these ! • Special computer systems and programming methods can help overcome these barriers
Speeding things up • Key to reducing run times is parallelism - splitting large problems into smaller tasks which can be tackled at the same time (i.e. “in parallel” or “concurrently”) • Two main types of parallelism: • data parallelism • functional parallelism (pipelining) • Tasks may be independent or inter-dependent (this eventually limits the speed up which can be achieved) • Fortunately many statistical problems exhibit data parallelism and tasks can be performed independently … • this can lead to very significant speed ups !
High Performance Computing (HPC) • Uses powerful special purpose systems called HPC clusters • Contain large numbers of processors acting in parallel • Each processor may contain multiple processing elements (cores) which can also work inparallel • Provide lots of memory and large amounts of fast (parallel) disk storage – ideal for data-intensive applications • Typically run parallel programs containing inter-dependent tasks (e.g. finite element analysis codes) but also suitable for statistics applications
High Throughput Computing (HTC) using Condor • No dedicated hardware - uses ordinary classroom PCs to run jobs when then they would otherwise be idle (usually evenings and weekends) • Jobs may be interrupted by users logging into Condor PCs – works best for short running jobs (10-20 minutes ideally, ~ 8 hours max) • Only suitable for applications which use independent tasks (need to use HPC inter-dependent tasks) • No shared storage – all data files must be transferred to/from the Condor PCs • Limited memory and disk space available since Condor uses only commodity PCs • However… Condor is well suited to many statistical applications
Condor pool operation Desktop PC Condor Server login and upload input data Execute hosts Execute hosts
Condor pool operation Desktop PC Condor Server jobs jobs Execute hosts Execute hosts
Condor pool operation Desktop PC Condor Server results results Execute hosts Execute hosts
Condor pool operation Desktop PC Condor Server download results Execute hosts Execute hosts
University of Liverpool Condor Pool • contains over 1000 classroom PCs running the Managed Windows 10 Service • Each PC can run a maximum of 4 jobs concurrently giving a theoretical capacity of over 4000 parallel jobs • Typical spec: 3.3 GHz Intel i5 processor, 8 GB memory, 128 GB disk space • Tools are available to help in running large numbers of R and MATLAB jobs (other software may work but not commercial packages such as SAS and Stata). Also some Python support. • Single job submission point for Condor jobs provided by powerful UNIX server • Service can be also accessed from a Windows PC/laptop using Desktop Condor (even from off-campus)
Bootstrap example seed0.dat seed1.dat seed2.dat results0.dat results1.dat results2.dat bootstrap.R seed999.dat results999.dat combine samples.dat stats.dat
Bootstrap example $ ls bootstrap.Rsamples.dat seed*.dat bootstrap.Rseed27.datseed460.datseed641.datseed822.dat samples.datseed280.datseed461.datseed642.datseed823.dat seed0.datseed281.datseed462.datseed643.datseed824.dat ...
Bootstrap example $ ls bootstrap.Rsamples.dat seed*.dat bootstrap.Rseed27.datseed460.datseed641.datseed822.dat samples.datseed280.datseed461.datseed642.datseed823.dat seed0.datseed281.datseed462.datseed643.datseed824.dat ... $ cat run_bootstrap R_script = bootstrap.R indexed_input_files = seed.dat indexed_output_files = results.dat total_jobs = 1000
Bootstrap example $ ls bootstrap.Rsamples.dat seed*.dat bootstrap.Rseed27.datseed460.datseed641.datseed822.dat samples.datseed280.datseed461.datseed642.datseed823.dat seed0.datseed281.datseed462.datseed643.datseed824.dat ... $ cat run_bootstrap R_script = bootstrap.R indexed_input_files = seed.dat indexed_output_files = results.dat total_jobs = 1000 $ r_submitrun_bootstrap Submitting job(s)... 1000 job(s) submitted to cluster 952.
Bootstrap example $ ls bootstrap.Rsamples.dat seed*.dat bootstrap.Rseed27.datseed460.datseed641.datseed822.dat samples.datseed280.datseed461.datseed642.datseed823.dat seed0.datseed281.datseed462.datseed643.datseed824.dat ... $ cat run_bootstrap R_script = bootstrap.R indexed_input_files = seed.dat indexed_output_files = results.dat total_jobs = 1000 $ r_submitrun_bootstrap Submitting job(s)... 1000 job(s) submitted to cluster 952. $ condor_q -- Schedd: Q1@condor1 : <10.102.32.11:37851?... @ 05/21/19 11:26:56 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS smithicCMD: run_bootstrap.bat 5/21 10:57 68 130 802 1000 952.4-999
Bootstrap example $ ls bootstrap.Rsamples.dat seed*.dat bootstrap.Rseed27.datseed460.datseed641.datseed822.dat samples.datseed280.datseed461.datseed642.datseed823.dat seed0.datseed281.datseed462.datseed643.datseed824.dat ... $ cat run_bootstrap R_script = bootstrap.R indexed_input_files = seed.dat indexed_output_files = results.dat total_jobs = 1000 $ r_submitrun_bootstrap Submitting job(s)... 1000 job(s) submitted to cluster 952. $ condor_q -- Schedd: Q1@condor1 : <10.102.32.11:37851?... @ 05/21/19 11:26:56 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS smithicCMD: run_bootstrap.bat 5/21 10:57 68 130 802 1000 952.4-999 $ ls results*.dat results0.datresults281.datresults462.datresults643.datresults824.dat results100.datresults282.datresults463.datresults644.datresults825.dat results101.datresults283.datresults464.datresults645.datresults826.dat ...
Example application: Bayesian Models using MCMC • Fitting multivariate mixed models to model longitudinal data • Two main uses for Condor: • compare various similar models to select the best model • run simulations - simulate 100 datasets and to each one fit a MCMC model • Single simulation takes ~ 1 day but Condor can run 100 simulations in parallel • Simulations take ~ 1 day instead of around ~ 3 months
Summary • Parallelismcan help speed up the solution of many research computing problems by dividing large problems into many smaller ones which can be tackled at the same time • Condor High Throughput Computing Service • Typically used for large/very large numbers of short running jobs • Limited memory and storage available on Condor PCs • Support available for applications using R, MATLAB and Python • No UNIX knowledge needed with Desktop Condor • High Performance Computing clusters • Typically used for small numbers of long running jobs • Ideal for applications requiring lots of memory and disk storage space • Almost all systems are UNIX-based
Further information • Condor Service: http://condor.liv.ac.uk • To request an account on Condor: go to ServiceNow then click: Make a request > Accounts > Application to access high performance/throughput computing facilities • Background information on HPC clusters: http://clusterinfo.liv.ac.uk • Information on the Advanced Research Computing (ARC) facilities: http://www.liv.ac.uk/csd/advanced-research-computing, http://arc.liv.ac.uk • To contact the ARC team email: arc-support@liverpool.ac.ukor contact me i.c.smith@liverpool.ac.uk • More presentations ???