Distributed computing at the Facility level: applications and attitudes

Distributed computing at the Facility level: applications and attitudes Tom Griffin STFC ISIS Facility tom.griffin@stfc.ac.uk NOBUGS 2008, Sydney

Spare cycles • Typical PC CPU usage is about 10% • Usage minimal 5pm – 8am • Most desktop PCs are really fast • Waste of energy • How can we use (“steal?”) unused CPU cycles to solve computational problems?

Types of Application • CPU Intensive • Low to moderate memory use • Not too much file output • Coarse grained • Command line / batch driven • Licensing issues?

Distributed computing solutions Lots of choice CONDOR, GridEngine, GridMP… • Grid MP Server hardware • Two, dual Xeon 2.8GHz servers RAID 10 • Software • Servers run RedHat Linux Enterprise Server / DB2 • Unlimited Windows (and other) clients • Programming • Web Services interface – XML, SOAP • Accessed with C++ , Java, C# • Management Console • Web browser based • Can manage services, jobs, devices etc • Large industrial user base • GSK, J&J, Novartis etc.

Installing and Running Grid MP Server Installation 2 hours Client Installation Create MSI and RPM using ‘setmsiprop’ 30 seconds Manual Install Better security on Linux and Macs

Adapting a program for GridMP • Fairly easy to write • Interface to grid via Web Services • C++, Java, C# • Think about how to split your data • Wrap your executable • Write the application service • Pre and Post processing

Package your executable DLLs Standard data files Executable Environment variables } PROGRAM MODULE EXECUTABLE Compress? Encrypt? Uploaded to, and resident on, the server

Create / run a job Proteins Molecules Pkg3 Pkg4 Pkg2 Pkg1 Client side https:// Datasets Create job, generate cross product Server side Workunits Start job

Code examples Mgsi.Job job = new Mgsi.Job(); job.application_gid = app.application_gid; job.description = txtJobName.Text.Trim(); job.state_id = 1; job.job_gid = ud.createJob(auth, job); Mgsi.JobStep js = new Mgsi.JobStep(); js.job_gid = job.job_gid; js.state_id = 1; js.max_concurrent = 1 js.max_errors = 20; js.num_results = 1; js.program_gid = prog.program_gid;

Code examples • Mgsi.DataSet ds =new Mgsi.DataSet(); • ds.job_gid = job.job_gid; • ds.data_set_name = job.description + "_ds_" + DateTime.Now.Ticks; • ds.data_set_gid = ud.createDataSet(auth, ds); • for (int i = 1; i <= numWorkunits.Value; i++) { • FileTransfer.UploadData uploadD = ft.uploadFile(auth, Application.StartupPath + "\\testdata.tar"); • Mgsi.Data data = new Mgsi.Data(); • data.data_set_gid = ds.data_set_gid; • data.index = i; • data.file_hash = uploadD.hash; • data.file_size = long.Parse(uploadD.size); • datas[i - 1] = data; } • ud.createDatas(auth, datas); • ud.createWorkunitsFromDataSetsAsync(auth, js.job_step_gid, new string[] { ds.data_set_gid }, options);

Performance Famotidine form B 13 degrees of freedom P21/c V=1421 Sync data to 1.64A 1 x 107 moves per run, 64 runs Standard DASH 2.4GHz Core2 Quad using single core Gdash submit to test grid of 5 in-use PCs 4 x 2.4GHz Core2 Quad 1 x 2.8GHz Core2 Quad Job complete = 9 hrs Job complete = 24 minutes Speedup = 22.5 x

Performance – 999 SA runs, full grid 4 days 18 hours CPU in ~40 minutes elapsed time 317 cores from 163 devices 42 Athlons: 1.6–2.2Ghz 168 Core 2 duos: 1.8–3 Ghz 36 Core 2 quads: 2.4–2.8 Ghz 1 duron @ 1.2Ghz 42 P4s 2.4–3.6Ghz 27 Xeons: 2.5–3.6Ghz Workunits Time

A Particular Success - McStas HRPD supermirror guide design Complex design Meaningful simulations take a long time Want to try lots of ideas Many runs of >200 CPU days Simpler model was best value Massive improvement in flux Significant cost savings

Problems • McStas • Interactions in the wild • Symantec Anti-Virus • Did not show up in testing • McStas restricted to night running only

User Attitudes • A range • Theft • “I’m not having that on my machine” • First thing to get blamed • Gaining more trust • Evangelism by users

Flexibility with virtualisation • Request to run ‘GARefl’ code • ISIS is Windows based • Few Linux PCs • VMWare server is freeware • 8 Hosts gave 26 cores • More cores = more demand • 56 real cores recruited from servers, 64-core Beowulf • 10 mac cores • Run Linux as a job

Flexibility with virtualisation

The Future Grid growing in power every day New machines added, old ones still left on Electricity Energy saving drive at STFC – switch machines off Wake On-LAN ‘Magic Packets’ + Remote hibernate Laptops Good or bad?

Summary Distributed computing Perfect for coarse-grained,CPU intensive, ‘disk-lite’ Resources Use existing resources. Power increases with time, no need to write-off assets. Scalable Not just faster Allows one to try different scenarios Virtualisation Linux under Windows, Windows under Linux. Green credentials PCs are running anyway, better to utilise them. Can be powered down & up.

Acknowledgements • ISIS Data Analysis Group • Kenneth Shankland • Damian Flannery • STFC FBU IT Service Desk and ISIS Computing Group • Key Users • Richard Ibberson (HRPD) • Stephen Holt (GARefl) • Questions?

Distributed computing at the Facility level: applications and attitudes

Distributed computing at the Facility level: applications and attitudes

Presentation Transcript

An Overview Grid Computing and Applications

Employee Attitudes and Their Effects

Chapter 23

Designing Distributed Applications using Mobile Agents

Attitudes

The iphone

Chapter 22: Distributed Databases

Attitudes By Dave Batty

Part 2 Distributed Systems 2009

Chapter 19: Distributed Databases

JOB RELATED ATTITUDES

Distributed Systems: Distributed algorithms

Distributed Parallel Computing

Distributed Databases

Status

Parallel Computing with OpenMP on distributed shared memory platforms

DISTRIBUTED COMPUTING

Distributed Memory and Datastream-based Reconfigurable Computing

DISTRIBUTED COMPUTING

雲端計算 Cloud Computing

Distributed Database Systems