1 / 38

Quill / Quill++ Tutorial European Condor Week June 2006 INFN Milan, Italy

Quill / Quill++ Tutorial European Condor Week June 2006 INFN Milan, Italy. What is Quill?. A non-invasive method of storing a read only version of the job queue and job historical data in a relational database. Why Do We Need It?.

leoma
Download Presentation

Quill / Quill++ Tutorial European Condor Week June 2006 INFN Milan, Italy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quill / Quill++ TutorialEuropean Condor WeekJune 2006INFN Milan, Italy

  2. What is Quill? A non-invasive method of storing a read only version of the job queue and job historical data in a relational database.

  3. Why Do We Need It? • Presents the job queue information as a set of tables in a relational database (Big Win!) • Fault tolerance • Provides performance enhancements in very large and busy pools

  4. schedd schedd Database quilld Job Queue Job Queue Job Queue Management Without Quill With Quill

  5. Deployment • One Quill daemon per schedd • Quill daemons must be uniquely named • Each Quill daemon uses a unique DB name • Multiple Quill daemons may utilize one database server • Currently uses PostgreSQL • Recommend PostgreSQL 8.1 or later for automatic vacuuming of tables

  6. Condor’s Interface to Quill • Modified two tools to utilize the DB • condor_q • condor_history • Very minor modifications to schedd • Multiple sources for Job Queue & History pose an interesting problem

  7. schedd Database quilld Job Queue Job Queue Discovery Sequence(Local Query) 2 1 3 condor_q

  8. schedd Database quilld collector Job Queue Job Queue Discovery Sequence(Remote Query) 2 1 0 3 condor_q

  9. A User Perspective: condor_q • condor_q changes • -name takes a ScheddName or QuillName • -avgqueuetime details average time in queue for all jobs

  10. A User Perspective: condor_qExample: condor_q -name Linux merlin > condor_q -name psilord_quilld@merlin.cs -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 92.0 psilord 4/21 09:21 0+00:00:00 I 0 9.8 foo 1 jobs; 1 idle, 0 running, 0 held

  11. A User PerspectiveExample: condor_q -avgqueuetime Linux merlin > condor_q -avgqueuetime -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db Average time in queue for uncompleted jobs (in hh:mm:ss) 00:40:47.011993

  12. Database quilld Job Queue History File Job History Discovery Sequence(Local Query) The quilld is never queried directly! 1 2 condor_history

  13. Database quilld collector Job Queue History File Job History Discovery(Remote Query) NEW! The quilld is never queried directly! 1 0 condor_history

  14. A User Perspective: condor_history • condor_history changes • -name takes a Quill Name to retrieve job histories from a remote quill’s database • -completedsince returns all jobs completed since a PostgreSQL formatted date

  15. A User Perspective: condor_historyExample: condor_history -name Linux merlin > condor_history -name psilord_quilld@merlin.cs -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 91.0 psilord 4/20 14:23 0+00:00:00 X ??? /scratch/psilor 92.0 psilord 4/21 09:21 0+00:00:00 X ??? /scratch/psilor 93.0 psilord 4/21 10:12 0+00:00:01 C 4/21 10:12 /scratch/psilor

  16. A User Perspective: condor_historyExample: condor_history -completedsince Linux merlin > condor_history -completedsince "2006-01-01 00:00:01" -- DB: psilord_quilld@merlin.cs : <merlin.cs.wisc.edu:42999> : psilord_db ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 93.0 psilord 4/21 10:12 0+00:00:01 C 4/21 10:12 /scratch/psilor

  17. Short Circuiting the Discovery Sequence • Use the –direct option! • Examples • condor_q –direct rdbms • condor_q –direct quilld • condor_q –direct schedd • “rdbms”, “quilld”, and “schedd” are the actual parameters. • Invaluable for debugging!

  18. PostgreSQL 8.1 Installation • ./configure • gmake && gmake install • mkdir /path/to/pgsql/data • initdb –D /path/to/pgsql/data • postmaster –D /path/to/pgsql/data • Note: Default port binding is 5432.

  19. PostgreSQL Configuration • Add two special user accounts: quillreader and quillwriter • createuser quillreader --no-createdb --no-adduser --pwprompt • createuser quillwriter --createdb --no-adduser --pwprompt

  20. PostgreSQL Configuration (cont) • Allow TCP/IP connections • Edit file postgresql.conf • Add listen_address = '*' • Allow connections from specific hosts • Edit file pg_hba.conf • host all quillreader 128.105.0.0 255.255.0.0 password • host all quillwriter 128.105.0.0 255.255.0.0 password • Note: only use ‘password’ authentication at this time.

  21. Quill Configuration • User quillwriter needs a write password. • Store it in a file called .quillwritepassword in the $(SPOOL) directory. • Ensure only the condor uid can read it if Condor is running as root

  22. Quill Configuration (cont) • Condor system specific attributes in file condor_config.local • QUILL = $(SBIN)/condor_quill • QUILL_LOG = $(LOG)/QuillLog • QUILL_ADDRESS_FILE = $(LOG)/.quill_address • DAEMON_LIST = …, QUILL • VALID_SPOOL_FILES = …, .quillwritepassword • DC_DAEMON_LIST = …, QUILL

  23. Quill Configuration (cont) • Quill specific attributes QUILL_ENABLED = TRUE # The quill name must be unique across all # quill daemons AND schedds QUILL_NAME = psilord_quilld@merlin.cs.wisc.edu QUILL_DB_NAME = psilord_db QUILL_DB_IP_ADDR = merlin.cs.wisc.edu:5432 QUILL_POLLING_PERIOD = 10(seconds)

  24. Quill Configuration (cont) • QUILL_HISTORY_CLEANING_INTERVAL = 24 (hours) • QUILL_HISTORY_DURATION = 30 (days) • QUILL_MANAGE_VACUUM = FALSE • QUILL_IS_REMOTELY_QUERYABLE = TRUE • QUILL_DB_QUERY_PASSWD = xxx

  25. DB Storage Method • Schema designed to store and query classads • 4 tables to represent the job queue classads • 2 for history data • 1 for metadata • Some queries are easier than others • Ask more questions at the BOF!

  26. Quill++ • More comprehensive than Quill (data from all daemons, not just SchedD) • Built on Quill code base • Condor daemons write to SQL logs, Quill daemon reads and inserts in DBMS • Central database serves entire pool • Web-based query GUI

  27. Schedd Schedd Shadow Startd Database Starter Negotiator A Machine Data Capture in Quill++ • Condor daemons augmented to record important events in a database • Database is in addition to standard daemon logs • Pool will run unaffected even in the absence of a database

  28. Master … Startd Schedd Quill++ Store events Write events Get new events RDBMS Queue, History, Machine, Match etc. Job Queue log Event logs Quill++ Architecture

  29. Implementation Details • Quill++: First class condor daemon • Managed by Condor Master • Native PostgreSQL API • Can be ported to any platform for which PostgreSQL drivers are available (AIX, BSD, IRIX, HP-UX, Linux, Solaris, Windows etc.) • Porting Quill++ to other databases involves implementing a database virtual class

  30. Web Interface • Useful for: • User job monitoring • Administrative monitoring over jobs and resources • Debugging

  31. Jobs in queue History jobs Machine Status Recency summary Condordb Admin Screen

  32. Job history by owner

  33. Machine Report

  34. Classad Info Run Info Event Info Match Info Rejects Info Status about a job

  35. Recency info for exceptional data sources

  36. Quill++ Present Status • Deployed in testbed • dbc cluster (93 machines) • Has successfully run almost 100,000 jobs. • Planning distribution with early v6.9.x Condor release.

  37. Quill++ Caveats • Web interface to DB • Basic prototype implemented • Needs to be made more robust, user friendly (!) • Gathers incomplete information in multiple pool scenarios (flocking, glide-in, condor-c)

  38. Thank you!

More Related