1 / 50

CEDPS Troubleshooting

Learn how to troubleshoot slow or failed jobs, workflows, and data transfers using the CEDPS and NetLogger tools. Understand concepts, logging best practices, and the architecture of the system. Explore the VDT package and its components, including the syslog-ng, NetLogger pipeline, and example web page with graphs. Plan for unified logging and automated troubleshooting.

fernandom
Download Presentation

CEDPS Troubleshooting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CEDPS Troubleshooting OSG All-Hands Meeting 2 March 2009

  2. Outline • Introduction • Concepts • VDT Package • Syslog-ng • NetLogger Pipeline • Example web page, graphs • Plans

  3. Introduction

  4. Central question Why did my job, workflow, or data transfer slow down, stop, or fail?

  5. End to End Problem Problem could be in the middleware, network, end hosts, or application. We need to correlate information from all of these sources.

  6. Solution Unified logging Unified data model / query model Secure, targeted, automated troubleshooting tools

  7. Concepts

  8. CEDPS and NetLogger • NetLogger Toolkit is a project that started 10 years ago • Current work in CEDPS is part of the newest incarnation of the NetLogger software • I may use the two terms interchangeably: right now CEDPS Troubleshooting is mostly NetLogger and vice-versa

  9. Logging Best Practices (BP) • Log message for start and end of every “interesting” operation • Simple name=value format • Named event types, timestamps on every message • Additional arbitrary user-defined attributes • Easy to grep, parse

  10. BP Logging Example timestamp ts=2006-12-08T18:48:27.598448Z event=gridFTP.transfer.start file=myfile.dat src.host=foo.org src.port=4321 dest.host=bar.org dest.port=1234 transfer.id=11123 p.id=15432\n event type user attributes identifiers “Grid Logging: Best Practices” guide http://www.cedps.net/index.php/LoggingBestPractices

  11. A better mousetrap • Current state-of-the-art: • login to sites, grep and process logs • or collect a subset of the available information centrally • For troubleshooting, we need: • Consistent, expressive, remote querying • ..of all the available information

  12. Architecture (1)

  13. Architecture (2) Site Sng Sng Sng Sng User submits a job Logs Logs Syslog-ng sender Grid scheduler Disk Syslog-ng receiver Logs Local scheduler Disk Log normalization and database insertion Compute nodes… Disk DB (shared)

  14. Example: PDSF We are now collecting logs 24/7 on PDSF Gets all SRM, Grid, and local scheduler (SGE) logs

  15. VDT Package

  16. VDT Package Overview • Relies on Syslog-ng VDT package: “sender” and “receiver” configurations • Contains: • NetLogger pipeline (logs to DB) • Other utilities for parsing, viewing, analyzing logs • Simple web page (PHP) that queries DB

  17. #foo Documentation • For the most part, you can follow along in the 4.1.1-alpha documentation: • http://acs.lbl.gov/NetLogger-releases/doc/4.1.1a/manual.html • Will indicate relevant section with #section at top-right of page • (even though it’s alpha, it’s about as stable as 4.1.0)

  18. #syslogng Syslog-ng VDT package $ pacman -get OSG:Syslog-ng • syslog-ng-sender.conf • syslog-ng-receiver.conf • These need to be modified to send to the site’s local collector: # Destinations #destination local_collector { udp("osg-log.uchicago.edu" port(5145) ); }; destination local_collector { tcp("osp.nersc.gov" port(5145) ); }; TCP transport is more reliable

  19. #nl_pipeline NetLogger Pipeline nl_parser nl_loader nl_pipeline

  20. #nl_pipeline Pipeline files – default layout • etc • nl_loader.conf • nl_parser.conf • var • log • nl_pipeline.log • nl_loader.log • nl_parser.log • run • nl_loader.pid • nl_loader.state • nl_parser.pid • nl_parser.state • nl_pipeline.pid site-specific configuration settings internal log files internal persistent state (for restarts) and PID

  21. #nl_pipeline nl_pipeline • Simple “manager” program • Forks the nl_loader and nl_parser • Sends periodic signals to both of them to save state and re-read configuration • Kills them, gracefully, when it is killed (gracefully)

  22. #nl_pipeline nl_pipeline example $ nl_pipeline –c ROOT_DIR • Looks for nl_loader and nl_parser in same dir. as nl_pipeline (optionally can use $PATH) • -n option to show paths used ROOT_DIR • etc • nl_loader.conf • nl_parser.conf • var • log • ... • run • ... (see prev. slide for details)

  23. #nl_pipeline Configuration files • Extends familiar “INI” format # snippet @include shared.conf [global] state_file=${D}/var/run [[example]] foo=bar comment include other file section variable subst. sub-section

  24. nl_parser • Read multiple “raw” log files from multiple directories • Parse them and output a single output file (with roll-over, a series of such files) • Files can be “tailed” in parallel • Alternate command-line mode: parse a single input to a single output

  25. nl_parser & nl_loader periodic file roll-overs output.M Input files nl_parser nl_loader ... output.N

  26. #nl_parser nl_parser configuration • Sets input files • Matches input files to parser “modules” • Sets output file • including roll-over options • Sets program logging level • Set “throttling” level • Sets “state” file

  27. #nl_parser Choosing parsers (1) • Whole-file parser [parser_section] [[pbs]] files = pbs*.txt [[[parameters]]] # parameters for module site = pdsf.lbl.gov type = FILE~pbs_(.*)\.txt arbitrary name parser module files to parse module params

  28. #nl_parser Choosing parsers (2) • Per-line parser [myparser] files = *.log pattern = " (?P<level>[A-Z]+)/(?P<app>):” [[bestman]] [[[match]]] app = "bestman" [[pbs]] [[[match]]] app = "PBS” arbitrary name files to parse extract pattern from each line match to “app” subpattern match to “app” subpattern

  29. #nl_parser nl_parser throttling • If the parser needs to “catch up” by parsing some large input files, it can use a lot of CPU • Throttling is a heuristic to limit this: [global] throttle = 0.2 # max. 20% of 1 CPU

  30. #nl_parser nl_parser state • Current offset in all input files saved periodically • Additional parser-specific state may also be saved • The bestman parser module, for example, maintains state mapping thread id’s to the global request identifier • All state restored at restart / reconfig

  31. #nl_parser Parser module descriptions • nl_parser –m <module> -i $ nl_parser -m bestman -i Module : netlogger.parsers.modules.bestman Description: Parse logs from Berkeley Storage Manager (BeStMan). Parameters : - version {1,2}: Version 1 is anything before bestman-2.2.1.r3, Version 2 is that version and later ones. - transfer_only {True,*False*}: For Version2, report only events that are necessary to determine transfer performance. See also http://datagrid.lbl.gov/bestman/

  32. #nl_loader nl_loader • Read a Best-Practices format file, and load it into MySQL, PostgreSQL, or SQLite • Can follow a series of files • delete/move “finished” files • For MySQL and PostgreSQL, the database must exist (e.g. createdb), but the nl_parser can create the tables

  33. #nl_loader Database schema event name, timestamp, level, “.start/.end” suffix user-defined attributes e.g. foo=bar attrs ending in “.id” “text” attr “DN” attr

  34. #nl_loader Populating the DB ts=2008-09-16T21:52:16.385281Z event=run.start level=Info job.id=123 DN=mydn user=dang event dn attr ident

  35. Querying the DB select e.time, i.value jobid, u.value user, d.value dn from event e join ident i on i.e_id=e.id join attr u on u.e_id=e.id join dn d on d.e_id=e.id where i.name=‘job’ and u.name=‘user’

  36. #nl_loader nl_loader state • Current offset in input file saved periodically • All state restored at restart / reconfig

  37. Duplicate data • Duplicates can happen • The hash column in the event table avoids duplicate events • nl_loader checks foreign-key constraints at the application layer so when a duplicate event is skipped, no other “junk” ends up in the related tables either

  38. Web interface • Simple PHP pages to show how to query database • Can correlate Grid jobs with SGE statistics about those jobs http://localhost/monpage/jobs.php

  39. Web interface

  40. Analysis • Our tool of choice is “R”, a language for data analysis • Existing analyses: • SRM/BeStMan logs • Pegasus-WMS performance analysis • More to come..

  41. Bestman graph example Aggregate BW Max ≈ 20Mb/s Single-stream 5Mb/s

  42. Pegasus graph example

  43. nl_actions • Periodic actions (esp. with DB): • Roll over database • Create derived databases • Clean up old files • How it works: • Python modules, like parser modules • Simple configuration file with “schedule” and parameters

  44. Plans

  45. When is the VDT package going to be ready? • Hoped to say “now”. Sigh. • New deadline: April 1st • Software can be used now, but packaging is not complete • Some rough edges in documentation and error-reporting • Early adopters welcome! We will help.

  46. The better mousetrap, revisited Use OGSA-DAI to perform authenticated JOINs across multiple collectors

  47. Sharing logs Sites do not need to trust, or share with, a central repository. Permissions to query logs leverage Grid credentials and are orthogonal to user permissions

  48. integration • Provide perfSONAR interface in front of NetLogger database • i.e. make subset of NetLogger database a perfSONAR Measurement Archive • Or, pull perfSONAR data into NetLogger database • Use perfSONAR as way of locating NetLogger databases

  49. References Troubleshooting wiki area http://www.cedps.net/index.php/Troubleshooting NetLogger home page http://acs.lbl.gov/NetLoggerWiki/

  50. Questions?

More Related