PCGRID ‘ 08 Workshop, Miami, FL April 18, 2008 Preston Smith <psmith@purdue>

PCGRID ‘08 Workshop, Miami, FL April 18, 2008Preston Smith <psmith@purdue.edu> Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid • Introduction • Environment • Motivation • Challenges • Infrastructure • Usage Tracking • Storage • Staffing • Future Work • Results

BoilerGrid - Growth To here? How did we get from here….

BoilerGrid - Rosen Center for Advanced Computing • Research Computing arm of ITaP - Information Technology at Purdue • Clusters in RCAC are arranged in larger “Community Clusters” • One cluster, one configuration, many owners • Leverages economies of scale for purchasing, and provides expertise in systems engineering, user support, and networking

BoilerGrid - Motivation • Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity • Community clusters used approximately 70% of capacity • Condor installed on community clusters to cycle-scavenge from PBS, the primary scheduler • Goal: provide a general-purpose high-throughput computing resource on existing hardware

BoilerGrid - Challenges • In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software • An overhaul of the Condor infrastructure was needed!

BoilerGrid - Keep Condor Up-to-date • Upgrading Condor • In late 2005, we were running Condor version 6.6.5, which was 1.5 years old. • First, we needed to upgrade! • In a large, busy, Condor grid, we found it’s usually advantageous to run the developmentrelease of Condor • Early access to new features, scalability improvements

BoilerGrid - Pool Design • Use many machines • In 2005, we ran a single Condor pool with ~1800 machines. • In 2005, the largest single Condor pools in existence were ~1000 machines. • We implemented BoilerGrid as a flock of 4 pools, of up to 1200 machines each. • Implementing BoilerGrid today? • Would have looked much different!

BoilerGrid - Submit Hosts • Many submit hosts • In 2005, a single host ran the Condor schedd and could submit jobs • Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs

BoilerGrid - Challenges • Usage Tracking • Tracking job-level accounting with a large Condor pool is difficult • Job history resides on every submit host • Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information • Deploying this on BoilerGrid now

BoilerGrid - Storage • If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers. • DAGMan and user logs on NFS can cause problems • The defaults don’t allow this for a reason! • Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files

BoilerGrid - Expansion • Successful use of Condor in clusters led us to identify partners around campus • Student computer labs operated by sister unit in ITaP (2500 machines and growing) • Library terminals (200 machines) • Other campuses (500+ machines) • Management support is critical! • Purdue’s CIO supports using Condor on many machines run by ITaP, including the one on his own desk

BoilerGrid - Expansion • An even better route of expansion • Condor users adding their own resources • Machines in their own lab • All the machines in their department • With distributed ownership comes new challenges • Regular contact with owner’s system administration staff • Ensure that owners are able to set their own policies

BoilerGrid - Staffing • Implementing BoilerGrid required minimal staff effort • Assuming an existing IT infrastructure exists that can operate many machines • .25 FTE ongoing to maintain Condor and coordinating with distributed Condor installations • With success comes more demand, and the end-user support to go along with it • 1.5 science support consultants assist with porting codes,training users to effectively use Condor

BoilerGrid - Future Work • TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs • Centralized Quill database for job and machine state • Excellent source of data for future research in distributed systems

BoilerGrid - Results

BoilerGrid - Conclusions • Condor is a powerful tool for getting real science done on otherwise unused hardware Questions? http://www.rcac.purdue.edu/boilergrid

PCGRID ‘ 08 Workshop, Miami, FL April 18, 2008 Preston Smith <psmith@purdue>