200 likes | 288 Views
Distributed computing at the Facility level: applications and attitudes. Tom Griffin STFC ISIS Facility tom.griffin@stfc.ac.uk NOBUGS 2008, Sydney. Spare cycles. Typical PC CPU usage is about 10% Usage minimal 5pm – 8am Most desktop PCs are really fast Waste of energy
E N D
Distributed computing at the Facility level: applications and attitudes Tom Griffin STFC ISIS Facility tom.griffin@stfc.ac.uk NOBUGS 2008, Sydney
Spare cycles • Typical PC CPU usage is about 10% • Usage minimal 5pm – 8am • Most desktop PCs are really fast • Waste of energy • How can we use (“steal?”) unused CPU cycles to solve computational problems?
Types of Application • CPU Intensive • Low to moderate memory use • Not too much file output • Coarse grained • Command line / batch driven • Licensing issues?
Distributed computing solutions Lots of choice CONDOR, GridEngine, GridMP… • Grid MP Server hardware • Two, dual Xeon 2.8GHz servers RAID 10 • Software • Servers run RedHat Linux Enterprise Server / DB2 • Unlimited Windows (and other) clients • Programming • Web Services interface – XML, SOAP • Accessed with C++ , Java, C# • Management Console • Web browser based • Can manage services, jobs, devices etc • Large industrial user base • GSK, J&J, Novartis etc.
Installing and Running Grid MP Server Installation 2 hours Client Installation Create MSI and RPM using ‘setmsiprop’ 30 seconds Manual Install Better security on Linux and Macs
Adapting a program for GridMP • Fairly easy to write • Interface to grid via Web Services • C++, Java, C# • Think about how to split your data • Wrap your executable • Write the application service • Pre and Post processing
Package your executable DLLs Standard data files Executable Environment variables } PROGRAM MODULE EXECUTABLE Compress? Encrypt? Uploaded to, and resident on, the server
Create / run a job Proteins Molecules Pkg3 Pkg4 Pkg2 Pkg1 Client side https:// Datasets Create job, generate cross product Server side Workunits Start job
Code examples Mgsi.Job job = new Mgsi.Job(); job.application_gid = app.application_gid; job.description = txtJobName.Text.Trim(); job.state_id = 1; job.job_gid = ud.createJob(auth, job); Mgsi.JobStep js = new Mgsi.JobStep(); js.job_gid = job.job_gid; js.state_id = 1; js.max_concurrent = 1 js.max_errors = 20; js.num_results = 1; js.program_gid = prog.program_gid;
Code examples • Mgsi.DataSet ds =new Mgsi.DataSet(); • ds.job_gid = job.job_gid; • ds.data_set_name = job.description + "_ds_" + DateTime.Now.Ticks; • ds.data_set_gid = ud.createDataSet(auth, ds); • for (int i = 1; i <= numWorkunits.Value; i++) { • FileTransfer.UploadData uploadD = ft.uploadFile(auth, Application.StartupPath + "\\testdata.tar"); • Mgsi.Data data = new Mgsi.Data(); • data.data_set_gid = ds.data_set_gid; • data.index = i; • data.file_hash = uploadD.hash; • data.file_size = long.Parse(uploadD.size); • datas[i - 1] = data; } • ud.createDatas(auth, datas); • ud.createWorkunitsFromDataSetsAsync(auth, js.job_step_gid, new string[] { ds.data_set_gid }, options);
Performance Famotidine form B 13 degrees of freedom P21/c V=1421 Sync data to 1.64A 1 x 107 moves per run, 64 runs Standard DASH 2.4GHz Core2 Quad using single core Gdash submit to test grid of 5 in-use PCs 4 x 2.4GHz Core2 Quad 1 x 2.8GHz Core2 Quad Job complete = 9 hrs Job complete = 24 minutes Speedup = 22.5 x
Performance – 999 SA runs, full grid 4 days 18 hours CPU in ~40 minutes elapsed time 317 cores from 163 devices 42 Athlons: 1.6–2.2Ghz 168 Core 2 duos: 1.8–3 Ghz 36 Core 2 quads: 2.4–2.8 Ghz 1 duron @ 1.2Ghz 42 P4s 2.4–3.6Ghz 27 Xeons: 2.5–3.6Ghz Workunits Time
A Particular Success - McStas HRPD supermirror guide design Complex design Meaningful simulations take a long time Want to try lots of ideas Many runs of >200 CPU days Simpler model was best value Massive improvement in flux Significant cost savings
Problems • McStas • Interactions in the wild • Symantec Anti-Virus • Did not show up in testing • McStas restricted to night running only
User Attitudes • A range • Theft • “I’m not having that on my machine” • First thing to get blamed • Gaining more trust • Evangelism by users
Flexibility with virtualisation • Request to run ‘GARefl’ code • ISIS is Windows based • Few Linux PCs • VMWare server is freeware • 8 Hosts gave 26 cores • More cores = more demand • 56 real cores recruited from servers, 64-core Beowulf • 10 mac cores • Run Linux as a job
The Future Grid growing in power every day New machines added, old ones still left on Electricity Energy saving drive at STFC – switch machines off Wake On-LAN ‘Magic Packets’ + Remote hibernate Laptops Good or bad?
Summary Distributed computing Perfect for coarse-grained,CPU intensive, ‘disk-lite’ Resources Use existing resources. Power increases with time, no need to write-off assets. Scalable Not just faster Allows one to try different scenarios Virtualisation Linux under Windows, Windows under Linux. Green credentials PCs are running anyway, better to utilise them. Can be powered down & up.
Acknowledgements • ISIS Data Analysis Group • Kenneth Shankland • Damian Flannery • STFC FBU IT Service Desk and ISIS Computing Group • Key Users • Richard Ibberson (HRPD) • Stephen Holt (GARefl) • Questions?