1 / 18

PCGRID ‘ 08 Workshop, Miami, FL April 18, 2008 Preston Smith <psmith@purdue>

PCGRID ‘ 08 Workshop, Miami, FL April 18, 2008 Preston Smith <psmith@purdue.edu>. Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University. BoilerGrid. Introduction Environment Motivation Challenges Infrastructure Usage Tracking Storage Staffing

gladys
Download Presentation

PCGRID ‘ 08 Workshop, Miami, FL April 18, 2008 Preston Smith <psmith@purdue>

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PCGRID ‘08 Workshop, Miami, FL April 18, 2008Preston Smith <psmith@purdue.edu> Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

  2. BoilerGrid • Introduction • Environment • Motivation • Challenges • Infrastructure • Usage Tracking • Storage • Staffing • Future Work • Results

  3. BoilerGrid - Growth To here? How did we get from here….

  4. BoilerGrid - Rosen Center for Advanced Computing • Research Computing arm of ITaP - Information Technology at Purdue • Clusters in RCAC are arranged in larger “Community Clusters” • One cluster, one configuration, many owners • Leverages economies of scale for purchasing, and provides expertise in systems engineering, user support, and networking

  5. BoilerGrid - Motivation • Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity • Community clusters used approximately 70% of capacity • Condor installed on community clusters to cycle-scavenge from PBS, the primary scheduler • Goal: provide a general-purpose high-throughput computing resource on existing hardware

  6. BoilerGrid - Challenges • In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software • An overhaul of the Condor infrastructure was needed!

  7. BoilerGrid - Keep Condor Up-to-date • Upgrading Condor • In late 2005, we were running Condor version 6.6.5, which was 1.5 years old. • First, we needed to upgrade! • In a large, busy, Condor grid, we found it’s usually advantageous to run the developmentrelease of Condor • Early access to new features, scalability improvements

  8. BoilerGrid - Pool Design • Use many machines • In 2005, we ran a single Condor pool with ~1800 machines. • In 2005, the largest single Condor pools in existence were ~1000 machines. • We implemented BoilerGrid as a flock of 4 pools, of up to 1200 machines each. • Implementing BoilerGrid today? • Would have looked much different!

  9. BoilerGrid - Submit Hosts • Many submit hosts • In 2005, a single host ran the Condor schedd and could submit jobs • Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs

  10. BoilerGrid - Challenges • Usage Tracking • Tracking job-level accounting with a large Condor pool is difficult • Job history resides on every submit host • Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information • Deploying this on BoilerGrid now

  11. BoilerGrid - Storage • If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers. • DAGMan and user logs on NFS can cause problems • The defaults don’t allow this for a reason! • Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files

  12. BoilerGrid - Expansion • Successful use of Condor in clusters led us to identify partners around campus • Student computer labs operated by sister unit in ITaP (2500 machines and growing) • Library terminals (200 machines) • Other campuses (500+ machines) • Management support is critical! • Purdue’s CIO supports using Condor on many machines run by ITaP, including the one on his own desk

  13. BoilerGrid - Expansion • An even better route of expansion • Condor users adding their own resources • Machines in their own lab • All the machines in their department • With distributed ownership comes new challenges • Regular contact with owner’s system administration staff • Ensure that owners are able to set their own policies

  14. BoilerGrid - Staffing • Implementing BoilerGrid required minimal staff effort • Assuming an existing IT infrastructure exists that can operate many machines • .25 FTE ongoing to maintain Condor and coordinating with distributed Condor installations • With success comes more demand, and the end-user support to go along with it • 1.5 science support consultants assist with porting codes,training users to effectively use Condor

  15. BoilerGrid - Future Work • TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs • Centralized Quill database for job and machine state • Excellent source of data for future research in distributed systems

  16. BoilerGrid - Results

  17. BoilerGrid - Results

  18. BoilerGrid - Conclusions • Condor is a powerful tool for getting real science done on otherwise unused hardware Questions? http://www.rcac.purdue.edu/boilergrid

More Related