380 likes | 571 Views
Quill / Quill++ Tutorial European Condor Week June 2006 INFN Milan, Italy. What is Quill?. A non-invasive method of storing a read only version of the job queue and job historical data in a relational database. Why Do We Need It?.
E N D
Quill / Quill++ TutorialEuropean Condor WeekJune 2006INFN Milan, Italy
What is Quill? A non-invasive method of storing a read only version of the job queue and job historical data in a relational database.
Why Do We Need It? • Presents the job queue information as a set of tables in a relational database (Big Win!) • Fault tolerance • Provides performance enhancements in very large and busy pools
schedd schedd Database quilld Job Queue Job Queue Job Queue Management Without Quill With Quill
Deployment • One Quill daemon per schedd • Quill daemons must be uniquely named • Each Quill daemon uses a unique DB name • Multiple Quill daemons may utilize one database server • Currently uses PostgreSQL • Recommend PostgreSQL 8.1 or later for automatic vacuuming of tables
Condor’s Interface to Quill • Modified two tools to utilize the DB • condor_q • condor_history • Very minor modifications to schedd • Multiple sources for Job Queue & History pose an interesting problem
schedd Database quilld Job Queue Job Queue Discovery Sequence(Local Query) 2 1 3 condor_q
schedd Database quilld collector Job Queue Job Queue Discovery Sequence(Remote Query) 2 1 0 3 condor_q
A User Perspective: condor_q • condor_q changes • -name takes a ScheddName or QuillName • -avgqueuetime details average time in queue for all jobs
A User Perspective: condor_qExample: condor_q -name Linux merlin > condor_q -name psilord_quilld@merlin.cs -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 92.0 psilord 4/21 09:21 0+00:00:00 I 0 9.8 foo 1 jobs; 1 idle, 0 running, 0 held
A User PerspectiveExample: condor_q -avgqueuetime Linux merlin > condor_q -avgqueuetime -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db Average time in queue for uncompleted jobs (in hh:mm:ss) 00:40:47.011993
Database quilld Job Queue History File Job History Discovery Sequence(Local Query) The quilld is never queried directly! 1 2 condor_history
Database quilld collector Job Queue History File Job History Discovery(Remote Query) NEW! The quilld is never queried directly! 1 0 condor_history
A User Perspective: condor_history • condor_history changes • -name takes a Quill Name to retrieve job histories from a remote quill’s database • -completedsince returns all jobs completed since a PostgreSQL formatted date
A User Perspective: condor_historyExample: condor_history -name Linux merlin > condor_history -name psilord_quilld@merlin.cs -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 91.0 psilord 4/20 14:23 0+00:00:00 X ??? /scratch/psilor 92.0 psilord 4/21 09:21 0+00:00:00 X ??? /scratch/psilor 93.0 psilord 4/21 10:12 0+00:00:01 C 4/21 10:12 /scratch/psilor
A User Perspective: condor_historyExample: condor_history -completedsince Linux merlin > condor_history -completedsince "2006-01-01 00:00:01" -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 93.0 psilord 4/21 10:12 0+00:00:01 C 4/21 10:12 /scratch/psilor
Short Circuiting the Discovery Sequence • Use the –direct option! • Examples • condor_q –direct rdbms • condor_q –direct quilld • condor_q –direct schedd • “rdbms”, “quilld”, and “schedd” are the actual parameters. • Invaluable for debugging!
PostgreSQL 8.1 Installation • ./configure • gmake && gmake install • mkdir /path/to/pgsql/data • initdb –D /path/to/pgsql/data • postmaster –D /path/to/pgsql/data • Note: Default port binding is 5432.
PostgreSQL Configuration • Add two special user accounts: quillreader and quillwriter • createuser quillreader --no-createdb --no-adduser --pwprompt • createuser quillwriter --createdb --no-adduser --pwprompt
PostgreSQL Configuration (cont) • Allow TCP/IP connections • Edit file postgresql.conf • Add listen_address = '*' • Allow connections from specific hosts • Edit file pg_hba.conf • host all quillreader 128.105.0.0 255.255.0.0 password • host all quillwriter 128.105.0.0 255.255.0.0 password • Note: only use ‘password’ authentication at this time.
Quill Configuration • User quillwriter needs a write password. • Store it in a file called .quillwritepassword in the $(SPOOL) directory. • Ensure only the condor uid can read it if Condor is running as root
Quill Configuration (cont) • Condor system specific attributes in file condor_config.local • QUILL = $(SBIN)/condor_quill • QUILL_LOG = $(LOG)/QuillLog • QUILL_ADDRESS_FILE = $(LOG)/.quill_address • DAEMON_LIST = …, QUILL • VALID_SPOOL_FILES = …, .quillwritepassword • DC_DAEMON_LIST = …, QUILL
Quill Configuration (cont) • Quill specific attributes QUILL_ENABLED = TRUE # The quill name must be unique across all # quill daemons AND schedds QUILL_NAME = psilord_quilld@merlin.cs.wisc.edu QUILL_DB_NAME = psilord_db QUILL_DB_IP_ADDR = merlin.cs.wisc.edu:5432 QUILL_POLLING_PERIOD = 10(seconds)
Quill Configuration (cont) • QUILL_HISTORY_CLEANING_INTERVAL = 24 (hours) • QUILL_HISTORY_DURATION = 30 (days) • QUILL_MANAGE_VACUUM = FALSE • QUILL_IS_REMOTELY_QUERYABLE = TRUE • QUILL_DB_QUERY_PASSWD = xxx
DB Storage Method • Schema designed to store and query classads • 4 tables to represent the job queue classads • 2 for history data • 1 for metadata • Some queries are easier than others • Ask more questions at the BOF!
Quill++ • More comprehensive than Quill (data from all daemons, not just SchedD) • Built on Quill code base • Condor daemons write to SQL logs, Quill daemon reads and inserts in DBMS • Central database serves entire pool • Web-based query GUI
Schedd Schedd Shadow Startd Database Starter Negotiator A Machine Data Capture in Quill++ • Condor daemons augmented to record important events in a database • Database is in addition to standard daemon logs • Pool will run unaffected even in the absence of a database
Master … Startd Schedd Quill++ Store events Write events Get new events RDBMS Queue, History, Machine, Match etc. Job Queue log Event logs Quill++ Architecture
Implementation Details • Quill++: First class condor daemon • Managed by Condor Master • Native PostgreSQL API • Can be ported to any platform for which PostgreSQL drivers are available (AIX, BSD, IRIX, HP-UX, Linux, Solaris, Windows etc.) • Porting Quill++ to other databases involves implementing a database virtual class
Web Interface • Useful for: • User job monitoring • Administrative monitoring over jobs and resources • Debugging
Jobs in queue History jobs Machine Status Recency summary Condordb Admin Screen
Classad Info Run Info Event Info Match Info Rejects Info Status about a job
Quill++ Present Status • Deployed in testbed • dbc cluster (93 machines) • Has successfully run almost 100,000 jobs. • Planning distribution with early v6.9.x Condor release.
Quill++ Caveats • Web interface to DB • Basic prototype implemented • Needs to be made more robust, user friendly (!) • Gathers incomplete information in multiple pool scenarios (flocking, glide-in, condor-c)