1 / 11

Instrumenting a Campus Grid with Condor Quill Preston Smith Purdue University

Instrumenting a Campus Grid with Condor Quill Preston Smith Purdue University. What is Condor Quill?. A technology to store a read-only version of a Condor system’s state:

Download Presentation

Instrumenting a Campus Grid with Condor Quill Preston Smith Purdue University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instrumenting a Campus Gridwith Condor QuillPreston SmithPurdue University

  2. What is Condor Quill? • A technology to store a read-only version of a Condor system’s state: • Job queue, job history, pool events, and both current and historical views of a pool’s machine state information in a relational database

  3. Why Quill? • An operational Condor grid is awash in data • Machine Information • Load • State (Claimed, Idle, Owner) • Job Information • History • Current queue state • File Transfers • A central DBMS can help capture, organize, manage, archive, and query this data. • Offloads query overhead from the condor_schedd • A Performance boost for the end user • Easier to mine pool data into end-user tools and portals

  4. Quill Architecture Master … Startd Schedd Quilld Store events Write events Get new events RDBMS Queue, History, Machine, Match etc. Job Queue log Event logs

  5. BoilerGrid Condor Pool • Purdue’s Campus Grid - 14,000+ cores • Rosen Center Linux clusters, student labs (Windows), departments and colleges from around the University • Condor is a TeraGrid resource! • Current work at the Purdue RP in developing tools to provide tools for improved access to Condor. • A heavily used, distributed resource of this scale presents challenges with managing all of the system’s meta-data.

  6. Growth

  7. Quill on BoilerGrid - Outcomes • Provide research data or work on the state of large, distributed systems • Several groups at Purdue, Notre Dame • Failure analysis and prediction, smart scheduling, accounting reporting for machine owners. • Mining job “events” table for user troubleshooting • Centralize usage information • Users wish to submit jobs from their desk • Condor’s distributed nature makes this a challenge to track usage - How can we access that job history to get a complete picture of BoilerGrid’s usage?

  8. Usage Reporting • quill=> select distinct scheddname,owner,cluster_id,proc_id,remotewallclocktime from jobs_horizontal_history where scheddname LIKE '%bio.purdue.edu%' LIMIT 10; • scheddname | owner | cluster_id | proc_id | remotewallclocktime • ------------------------+---------+------------+---------+-------------------- • epsilon.bio.purdue.edu | jiang12 | 276189 | 0 | 345 • epsilon.bio.purdue.edu | jiang12 | 280668 | 0 | 4456 • epsilon.bio.purdue.edu | jiang12 | 280707 | 0 | 1209 • epsilon.bio.purdue.edu | jiang12 | 280710 | 0 | 1197 • epsilon.bio.purdue.edu | jiang12 | 280715 | 0 | 1064 • epsilon.bio.purdue.edu | jiang12 | 280717 | 0 | 567 • epsilon.bio.purdue.edu | jiang12 | 280718 | 0 | 485 • epsilon.bio.purdue.edu | jiang12 | 280720 | 0 | 480 • epsilon.bio.purdue.edu | jiang12 | 280721 | 0 | 509 • epsilon.bio.purdue.edu | jiang12 | 280722 | 0 | 539 • (10 rows)

  9. Job Event Reporting Tools • -bash-3.00$ quill-events.pl 306.91 • Full Jobid Event Time • Description • ---------------------------------------------------------------- • 306.91.tg-condor.rcac.purdue.edu 2008-05-30 06:41:31-04 • Job was suspended (Number of processes actually suspended: 2) • 306.91.tg-condor.rcac.purdue.edu 2008-05-30 06:41:36-04 • Job was unsuspended • 306.91.tg-condor.rcac.purdue.edu 2008-05-30 08:58:24-04 • Job was suspended (Number of processes actually suspended: 2)

  10. Web Portals, Detailed Job Information Classad Info Run Info Event Info Match Info Rejects Info

  11. Further Information • BoilerGrid • http://www.rcac.purdue.edu/boilergrid • Rosen Center for Advanced Computing • http://www.rcac.purdue.edu • The Condor Project • http://www.cs.wisc.edu/condor • Contact: Preston Smith - psmith@purdue.edu

More Related