110 likes | 240 Views
Instrumenting a Campus Grid with Condor Quill Preston Smith Purdue University. What is Condor Quill?. A technology to store a read-only version of a Condor system’s state:
E N D
Instrumenting a Campus Gridwith Condor QuillPreston SmithPurdue University
What is Condor Quill? • A technology to store a read-only version of a Condor system’s state: • Job queue, job history, pool events, and both current and historical views of a pool’s machine state information in a relational database
Why Quill? • An operational Condor grid is awash in data • Machine Information • Load • State (Claimed, Idle, Owner) • Job Information • History • Current queue state • File Transfers • A central DBMS can help capture, organize, manage, archive, and query this data. • Offloads query overhead from the condor_schedd • A Performance boost for the end user • Easier to mine pool data into end-user tools and portals
Quill Architecture Master … Startd Schedd Quilld Store events Write events Get new events RDBMS Queue, History, Machine, Match etc. Job Queue log Event logs
BoilerGrid Condor Pool • Purdue’s Campus Grid - 14,000+ cores • Rosen Center Linux clusters, student labs (Windows), departments and colleges from around the University • Condor is a TeraGrid resource! • Current work at the Purdue RP in developing tools to provide tools for improved access to Condor. • A heavily used, distributed resource of this scale presents challenges with managing all of the system’s meta-data.
Quill on BoilerGrid - Outcomes • Provide research data or work on the state of large, distributed systems • Several groups at Purdue, Notre Dame • Failure analysis and prediction, smart scheduling, accounting reporting for machine owners. • Mining job “events” table for user troubleshooting • Centralize usage information • Users wish to submit jobs from their desk • Condor’s distributed nature makes this a challenge to track usage - How can we access that job history to get a complete picture of BoilerGrid’s usage?
Usage Reporting • quill=> select distinct scheddname,owner,cluster_id,proc_id,remotewallclocktime from jobs_horizontal_history where scheddname LIKE '%bio.purdue.edu%' LIMIT 10; • scheddname | owner | cluster_id | proc_id | remotewallclocktime • ------------------------+---------+------------+---------+-------------------- • epsilon.bio.purdue.edu | jiang12 | 276189 | 0 | 345 • epsilon.bio.purdue.edu | jiang12 | 280668 | 0 | 4456 • epsilon.bio.purdue.edu | jiang12 | 280707 | 0 | 1209 • epsilon.bio.purdue.edu | jiang12 | 280710 | 0 | 1197 • epsilon.bio.purdue.edu | jiang12 | 280715 | 0 | 1064 • epsilon.bio.purdue.edu | jiang12 | 280717 | 0 | 567 • epsilon.bio.purdue.edu | jiang12 | 280718 | 0 | 485 • epsilon.bio.purdue.edu | jiang12 | 280720 | 0 | 480 • epsilon.bio.purdue.edu | jiang12 | 280721 | 0 | 509 • epsilon.bio.purdue.edu | jiang12 | 280722 | 0 | 539 • (10 rows)
Job Event Reporting Tools • -bash-3.00$ quill-events.pl 306.91 • Full Jobid Event Time • Description • ---------------------------------------------------------------- • 306.91.tg-condor.rcac.purdue.edu 2008-05-30 06:41:31-04 • Job was suspended (Number of processes actually suspended: 2) • 306.91.tg-condor.rcac.purdue.edu 2008-05-30 06:41:36-04 • Job was unsuspended • 306.91.tg-condor.rcac.purdue.edu 2008-05-30 08:58:24-04 • Job was suspended (Number of processes actually suspended: 2)
Web Portals, Detailed Job Information Classad Info Run Info Event Info Match Info Rejects Info
Further Information • BoilerGrid • http://www.rcac.purdue.edu/boilergrid • Rosen Center for Advanced Computing • http://www.rcac.purdue.edu • The Condor Project • http://www.cs.wisc.edu/condor • Contact: Preston Smith - psmith@purdue.edu