500 likes | 509 Views
Learn how to troubleshoot slow or failed jobs, workflows, and data transfers using the CEDPS and NetLogger tools. Understand concepts, logging best practices, and the architecture of the system. Explore the VDT package and its components, including the syslog-ng, NetLogger pipeline, and example web page with graphs. Plan for unified logging and automated troubleshooting.
E N D
CEDPS Troubleshooting OSG All-Hands Meeting 2 March 2009
Outline • Introduction • Concepts • VDT Package • Syslog-ng • NetLogger Pipeline • Example web page, graphs • Plans
Central question Why did my job, workflow, or data transfer slow down, stop, or fail?
End to End Problem Problem could be in the middleware, network, end hosts, or application. We need to correlate information from all of these sources.
Solution Unified logging Unified data model / query model Secure, targeted, automated troubleshooting tools
CEDPS and NetLogger • NetLogger Toolkit is a project that started 10 years ago • Current work in CEDPS is part of the newest incarnation of the NetLogger software • I may use the two terms interchangeably: right now CEDPS Troubleshooting is mostly NetLogger and vice-versa
Logging Best Practices (BP) • Log message for start and end of every “interesting” operation • Simple name=value format • Named event types, timestamps on every message • Additional arbitrary user-defined attributes • Easy to grep, parse
BP Logging Example timestamp ts=2006-12-08T18:48:27.598448Z event=gridFTP.transfer.start file=myfile.dat src.host=foo.org src.port=4321 dest.host=bar.org dest.port=1234 transfer.id=11123 p.id=15432\n event type user attributes identifiers “Grid Logging: Best Practices” guide http://www.cedps.net/index.php/LoggingBestPractices
A better mousetrap • Current state-of-the-art: • login to sites, grep and process logs • or collect a subset of the available information centrally • For troubleshooting, we need: • Consistent, expressive, remote querying • ..of all the available information
Architecture (2) Site Sng Sng Sng Sng User submits a job Logs Logs Syslog-ng sender Grid scheduler Disk Syslog-ng receiver Logs Local scheduler Disk Log normalization and database insertion Compute nodes… Disk DB (shared)
Example: PDSF We are now collecting logs 24/7 on PDSF Gets all SRM, Grid, and local scheduler (SGE) logs
VDT Package Overview • Relies on Syslog-ng VDT package: “sender” and “receiver” configurations • Contains: • NetLogger pipeline (logs to DB) • Other utilities for parsing, viewing, analyzing logs • Simple web page (PHP) that queries DB
#foo Documentation • For the most part, you can follow along in the 4.1.1-alpha documentation: • http://acs.lbl.gov/NetLogger-releases/doc/4.1.1a/manual.html • Will indicate relevant section with #section at top-right of page • (even though it’s alpha, it’s about as stable as 4.1.0)
#syslogng Syslog-ng VDT package $ pacman -get OSG:Syslog-ng • syslog-ng-sender.conf • syslog-ng-receiver.conf • These need to be modified to send to the site’s local collector: # Destinations #destination local_collector { udp("osg-log.uchicago.edu" port(5145) ); }; destination local_collector { tcp("osp.nersc.gov" port(5145) ); }; TCP transport is more reliable
#nl_pipeline NetLogger Pipeline nl_parser nl_loader nl_pipeline
#nl_pipeline Pipeline files – default layout • etc • nl_loader.conf • nl_parser.conf • var • log • nl_pipeline.log • nl_loader.log • nl_parser.log • run • nl_loader.pid • nl_loader.state • nl_parser.pid • nl_parser.state • nl_pipeline.pid site-specific configuration settings internal log files internal persistent state (for restarts) and PID
#nl_pipeline nl_pipeline • Simple “manager” program • Forks the nl_loader and nl_parser • Sends periodic signals to both of them to save state and re-read configuration • Kills them, gracefully, when it is killed (gracefully)
#nl_pipeline nl_pipeline example $ nl_pipeline –c ROOT_DIR • Looks for nl_loader and nl_parser in same dir. as nl_pipeline (optionally can use $PATH) • -n option to show paths used ROOT_DIR • etc • nl_loader.conf • nl_parser.conf • var • log • ... • run • ... (see prev. slide for details)
#nl_pipeline Configuration files • Extends familiar “INI” format # snippet @include shared.conf [global] state_file=${D}/var/run [[example]] foo=bar comment include other file section variable subst. sub-section
nl_parser • Read multiple “raw” log files from multiple directories • Parse them and output a single output file (with roll-over, a series of such files) • Files can be “tailed” in parallel • Alternate command-line mode: parse a single input to a single output
nl_parser & nl_loader periodic file roll-overs output.M Input files nl_parser nl_loader ... output.N
#nl_parser nl_parser configuration • Sets input files • Matches input files to parser “modules” • Sets output file • including roll-over options • Sets program logging level • Set “throttling” level • Sets “state” file
#nl_parser Choosing parsers (1) • Whole-file parser [parser_section] [[pbs]] files = pbs*.txt [[[parameters]]] # parameters for module site = pdsf.lbl.gov type = FILE~pbs_(.*)\.txt arbitrary name parser module files to parse module params
#nl_parser Choosing parsers (2) • Per-line parser [myparser] files = *.log pattern = " (?P<level>[A-Z]+)/(?P<app>):” [[bestman]] [[[match]]] app = "bestman" [[pbs]] [[[match]]] app = "PBS” arbitrary name files to parse extract pattern from each line match to “app” subpattern match to “app” subpattern
#nl_parser nl_parser throttling • If the parser needs to “catch up” by parsing some large input files, it can use a lot of CPU • Throttling is a heuristic to limit this: [global] throttle = 0.2 # max. 20% of 1 CPU
#nl_parser nl_parser state • Current offset in all input files saved periodically • Additional parser-specific state may also be saved • The bestman parser module, for example, maintains state mapping thread id’s to the global request identifier • All state restored at restart / reconfig
#nl_parser Parser module descriptions • nl_parser –m <module> -i $ nl_parser -m bestman -i Module : netlogger.parsers.modules.bestman Description: Parse logs from Berkeley Storage Manager (BeStMan). Parameters : - version {1,2}: Version 1 is anything before bestman-2.2.1.r3, Version 2 is that version and later ones. - transfer_only {True,*False*}: For Version2, report only events that are necessary to determine transfer performance. See also http://datagrid.lbl.gov/bestman/
#nl_loader nl_loader • Read a Best-Practices format file, and load it into MySQL, PostgreSQL, or SQLite • Can follow a series of files • delete/move “finished” files • For MySQL and PostgreSQL, the database must exist (e.g. createdb), but the nl_parser can create the tables
#nl_loader Database schema event name, timestamp, level, “.start/.end” suffix user-defined attributes e.g. foo=bar attrs ending in “.id” “text” attr “DN” attr
#nl_loader Populating the DB ts=2008-09-16T21:52:16.385281Z event=run.start level=Info job.id=123 DN=mydn user=dang event dn attr ident
Querying the DB select e.time, i.value jobid, u.value user, d.value dn from event e join ident i on i.e_id=e.id join attr u on u.e_id=e.id join dn d on d.e_id=e.id where i.name=‘job’ and u.name=‘user’
#nl_loader nl_loader state • Current offset in input file saved periodically • All state restored at restart / reconfig
Duplicate data • Duplicates can happen • The hash column in the event table avoids duplicate events • nl_loader checks foreign-key constraints at the application layer so when a duplicate event is skipped, no other “junk” ends up in the related tables either
Web interface • Simple PHP pages to show how to query database • Can correlate Grid jobs with SGE statistics about those jobs http://localhost/monpage/jobs.php
Analysis • Our tool of choice is “R”, a language for data analysis • Existing analyses: • SRM/BeStMan logs • Pegasus-WMS performance analysis • More to come..
Bestman graph example Aggregate BW Max ≈ 20Mb/s Single-stream 5Mb/s
nl_actions • Periodic actions (esp. with DB): • Roll over database • Create derived databases • Clean up old files • How it works: • Python modules, like parser modules • Simple configuration file with “schedule” and parameters
When is the VDT package going to be ready? • Hoped to say “now”. Sigh. • New deadline: April 1st • Software can be used now, but packaging is not complete • Some rough edges in documentation and error-reporting • Early adopters welcome! We will help.
The better mousetrap, revisited Use OGSA-DAI to perform authenticated JOINs across multiple collectors
Sharing logs Sites do not need to trust, or share with, a central repository. Permissions to query logs leverage Grid credentials and are orthogonal to user permissions
integration • Provide perfSONAR interface in front of NetLogger database • i.e. make subset of NetLogger database a perfSONAR Measurement Archive • Or, pull perfSONAR data into NetLogger database • Use perfSONAR as way of locating NetLogger databases
References Troubleshooting wiki area http://www.cedps.net/index.php/Troubleshooting NetLogger home page http://acs.lbl.gov/NetLoggerWiki/