Implementing a Central Quill Database in a Large Condor Installation

Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edu Condor Week 2008 - April 30, 2008

Overview • Background • BoilerGrid • Motivation • What works well • What has been challenging • What just doesn’t work • Future directions

BoilerGrid • Purdue Condor Grid (BoilerGrid) • Comprised of Linux HPC clusters, student labs, machines from academic department, and Purdue regional campuses • 8900 batch slots today.. • 14,000 batch slots in a few weeks • 2007 - Delivered over 10 million CPU-hours to high-throughput science to Purdue and national community through Open Science Grid and TeraGrid

BoilerGrid - Growth

BoilerGrid - Results

A Central Quill Database • Condor 6.9.4, • Quill can store information about all the execute machines and daemons in a pool • Quill now able to store job history and queue contents in a single, central database. • Since December 2007, we’ve been working to store the state of BoilerGrid in a Quill installation

Motivation • Why would we want to do such a thing?? • Research into the state of a large distributed system • Several at Purdue, collaborators at Notre Dame • Failure analysis/prediction, smart scheduling, interesting reporting for machine owners • “events” table useful for user troubleshooting? • And one of our familiar gripes - usage reporting • Structural biologists (see earlier today) like to submit jobs from their desks, too • How can we access that job history to complete the picture of BoilerGrid’s usage?

The Quill Server • Dell 2850 • 2x 2.8GHz Xeons (hyperthreaded) • Postgres on 4-disk Ultra320 SCSI RAID-0 • 5GB RAM

What works well • Getting at usage data! quill=> select distinct scheddname,owner,cluster_id,proc_id,remotewallclocktime from jobs_horizontal_history where scheddname LIKE '%bio.purdue.edu%' LIMIT 10; scheddname | owner | cluster_id | proc_id | remotewallclocktime ------------------------+---------+------------+---------+-------------------- epsilon.bio.purdue.edu | jiang12 | 276189 | 0 | 345 epsilon.bio.purdue.edu | jiang12 | 280668 | 0 | 4456 epsilon.bio.purdue.edu | jiang12 | 280707 | 0 | 1209 epsilon.bio.purdue.edu | jiang12 | 280710 | 0 | 1197 epsilon.bio.purdue.edu | jiang12 | 280715 | 0 | 1064 epsilon.bio.purdue.edu | jiang12 | 280717 | 0 | 567 epsilon.bio.purdue.edu | jiang12 | 280718 | 0 | 485 epsilon.bio.purdue.edu | jiang12 | 280720 | 0 | 480 epsilon.bio.purdue.edu | jiang12 | 280721 | 0 | 509 epsilon.bio.purdue.edu | jiang12 | 280722 | 0 | 539 (10 rows)

What works, but is painful • Thousands of hosts pounding a Postgres database is non-trivial • Be sure to turn down QUILL_POLLING_PERIOD • Default is 10s - we went down to 1 hour on execute machines • At some level, this is an exercise in tuning your Postgres server. • Quick diversion into Postgres tuning 101.. top - 13:45:30 up 23 days, 19:59, 2 users, load average: 563.79, 471.50, 428. Tasks: 804 total, 670 running, 131 sleeping, 3 stopped, 0 zombie Cpu(s): 94.6% us, 2.9% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.4% hi, 2.2% si Mem: 5079368k total, 5042452k used, 36916k free, 10820k buffers Swap: 4016200k total, 68292k used, 3947908k free, 2857076k cached

Postgres • Assuming that there’s enough disk bandwidth…. • In order to support 2500 simultaneous connections, one must turn up max_connections • If you turn up max_connections, you need ~400 bytes of shared memory per slot. • Currently we have 2G of shared memory allocated

Postgres • Then you’ll need to turn up shared_buffers • 1G currently • Don’t forget fsm_pages… WARNING: relation "public.machines_vertical_history" contains more than "max_fsm_pages" pages with useful free space HINT: Consider compacting this relation or increasing the configuration parameter "max_fsm_pages".

What works, but is painful • So by now we can withstand the worker nodes reasonably well • Add schedds • condor_history returns history from ALL schedds • Bug fixed in 7.0.2 • The execute machines create enough load that condor_q is sluggish • Added a 2nd quill database server just for job information

What works, but is painful • If your daemons log a lot to sql.log files, but not writing to the database.. • Database down, etc • Your database is in a world of hurt while it tries to catch up..

What Hasn’t Worked • Many Postgres tuning guides recommend a connection pooler if you need scads of connections • pgpool-II • Pgbouncer • Tried both, Quill doesn’t seem to like it • It *did* reduce load…. But, often locked up the database (idle in transaction), and didn’t get anywhere

What can we do about it? • Throw hardware at the database! • Spindle count seems ok • Not I/O bound (any more) • More memory = more connections • 16GB? More? • More, faster CPUs • We appear to be CPU-bound now • Get latest multi-cores

What can we do about it? • Contact Wisconsin and call for rescue “Hey guys.. This is really hard on the old database” “Hmm. Let’s take a look.”

What can Wisconsin do about it? • Todd, Greg, and probably others take a look: • Quill always hits the database, even for unchanged ads • Postgres backend does not prepare SQL queries before submitting • Being fixed, Todd is optimistic • We’ll report with the results as soon as we have them

Future Directions • Reporting for users • Easy access to statistics about who ran on “my” machines. • Mashups, web portals • Diagnostic tools to help users • Troubleshooting, etc.

The End • Questions?

Backup slides

BoilerGrid - Results

Implementing a Central Quill Database in a Large Condor Installation