340 likes | 416 Views
Troubleshooting Data Movement. Dan Gunter LBNL. Background. Work is part of SciDAC CEDPS (Center for Enabling Distributed Petascale Science) Basic question: Why did my transfer (or remote operation) fail? We want to answer this question before the users even ask it!
E N D
Troubleshooting Data Movement Dan Gunter LBNL
Background • Work is part of SciDAC CEDPS (Center for Enabling Distributed Petascale Science) • Basic question: Why did my transfer (or remote operation) fail? • We want to answer this question before the users even ask it! • Instrument: middleware, applications, etc. • Monitor: gather data (in response to problems) • Diagnose: failures and performance problems
Topics • Two broad categories of work: • Gathering and normalizing existing data to allow analysis across sites (e.g. in OSG) • Adding new data through instrumentation of standard Grid middleware (e.g. GridFTP)
Syslog-ng • Features: • Can filter logs based on level and content • Arbitrary number of sources and destinations • Provide remote logging • Can act as a proxy, tunnel thru firewalls • Execute programs • Send email, load database, etc. • Built-in Log rotation • Timezone support
Log Parser • Normalizes unformatted logs to name=value pairs • Plug-in architecture to make it easy to add additional log file formats • We will provide a set of example parsers • But… • Parsers will be hard to write and maintain • If middleware and application developers follow the “logging best practices” document, the parser component will be not necessary
Missing Event Detector • Assumes all ‘start’ events should have a corresponding ‘end’ event • Looks for missing ‘end’ events • Generates a replacement ‘end’ event with an error code • Planning to develop more sophisticated anomaly detection capabilities as MDS trigger services.
Log Filter • Some sites may not want to forward potentially sensitive information to the VO archive • E.g.: usernames, user DN’s, IP addresses • Syslog-ng can filter entire events • But would prefer to just filter out the sensitive fields • Log filter will be able to remove or anonymize specific fields in the event
Database Loader • This component loads normalized logs into an SQL database • Ability to specify mapping of fields to database columns
Logging “Best Practices” Recommendations • Practices • Each logged event must contain a unique event name and an ISO-format timestamp (e.g. 2007-06-12T07:23:22.887332Z) • All system operations that might fail or experience performance variations should be wrapped with start and end events. • All logs from a given execution thread should be tagged with a globally unique ID (or GUID), such as a Universal Unique Identifiers (UUIDs) • Log format • Logs should be lines of ASCII name=value pairs • Example: ts=2006-12-08T18:48:27.598448Z event=org.globus.gridFTP.transfer.start guid=ID file=filename src.host=H1 src.port=P1 dst.host=H2 dst.port=P2
Event Names • Use a '.' as a separator and go from general to specific • Same as Java class names • First part of name should be used as a unique namespace (e.g.: org.globus) • Use start/end suffixes whenever possible • Helps immensely with troubleshooting • Examples • org.globus.gridFTP.start • org.globus.gridFTP.authn.start • org.globus.gridFTP.authn.end • org.globus.gridFTP.transfer.start • org.globus.gridFTP.transfer.end • org.globus.gridFTP.end org.globus.MDS.response.start org.globus.MDS.query.start org.globus.MDS.query.end org.globus.MDS.write.net.start org.globus.MDS.write.net.end org.globus.MDS.response.end
Reporting Errors • Errors should be reported as part of the ‘end’ event if possible • Use ‘status=N’ (>= 0 success) • Not attempting to define other status codes • too hard to get agreement on these • Example: • ts=2006-12-08T18:39:23.114369Z event=org.globus.authz.gridmap.end status=-1 DN=”/O=CEDS/CN=Some User” msg=”Cannot open gridmap file /etc/grid-security/grid-mapfile for reading” guid=F7D64975-069A-4152-A21F-57109AA46DFA level=ERROR
Error Reporting cont. • Depending on how program is structured, it may be hard to propagate the error message to the ‘end’ event • Use ‘error’ event name suffix in this situation • Examples: • event=org.globus.gridFTP.write.error path=“/home/grid” msg=“write error, disk full” • event=myprogram.input.error msg=“invalid input”
Globally Unique IDs • Use the ‘guid’ reserved name to allow correlation of a set of events together event=org.globus.gridFTP.authn.start guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.authn.end guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.transfer.start guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.transfer.end guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB • UUID easy to generate • uuidgen, uuidlib • MD5 hash in hexadecimal • But free-form ‘id’ also allowed (e.g. process ID)
Example: GridFTP ts=2006-12-08T18:39:23.114369Z event=org.globus.gridFTP.start prog=GridFTP-4.0.3 localhost=myhost remoteHost=somehost.gov:56010 serverMode=inetd id=56010 ts=2006-12-08T18:39:23.114567Z event=org.globus.gridFTP.authn.start DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” id=56010 ts=2006-12-08T18:39:25.514369Z event=org.globus.gridFTP.authn.end DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” msg=“123456 successfully authorized” localUser=uscmspool381 id=56010 status=0 ts=2006-12-08T18:39:25.864369Z event=org.globus.gridFTP.transfer.start file=/tmp/myfile tcpBufferSize=128KB dataBlockSize=262144 numStreams=1 numStripes=1 destHost=129.79.4.64 id=56010 ts=2006-12-08T18:45:02.214369Z event=org.globus.gridFTP.transfer.end file=/tmp/myfile bytesTransferred=678433 id=56010 status=0 ts=2006-12-08T18:45:02.214386Z event=org.globus.gridFTP.end id=56010 status=226
Logging API • We are not requiring any special library to generate log messages • We assume that programmers use one of the standard logging APIs (syslog, Java log4j, python logger, etc.) • Could also use ‘printf’, custom logging API, etc. • Syslog-ng can be used to forward any newline-delimited ASCII log file
Status • Working with GT4 developers to add this to • GRAM4, GridFTP, MDS4, Java Core, C Core, Delegation Service • Working with OSG on deployment of syslog-ng to gather up logs
Issue • Would like to have detailed I/O logging for performance analysis • But detailed logs can be far too large and intrusive • For example, a trace of the I/O operations performed by a single GridFTP server capable of saturating a 10 Gigabit network will generate • O(20,000) log events / second • over 70 million per hour • Need: • ongoing report of I/O characteristics • negligible perturbation ( < 1% ) and storage
Solution • Summarization • developer can log 1000’s of events/sec • run-time choice of summary granularity (easy to turn off by default) and algorithm • NetLogger Summarization Library • Summarizes logs before they are ever written to disk • Huge reduction of log volume, while retaining important information • Can be used for bottleneck analysis and performance anomaly detection • General-purpose tool can be extended to do different kinds of summarization (currently only does time-based)
How summarization works Log calls Output Code for (i=0; i < N; i++) { nl_write(log, “loop.start”, “id=i”, 0); double v = do_work(); nl_write(log, “loop.end”, “id=i val=d”, 0, v); } 0 sec start/end of each loop Configuration (XML version) <config> <event-name-keys>event</event-name-keys> <function id='f1' type='tsumm' url='/tmp/summarized.log'>" <param name='interval' >1</param> <param name='value' >val</param> </function> <event-group> <consume/> <function id='f1'> <param name='event'>loop.summ</param> </function> <id-keys>id</id-keys> <event>loop.start</event> <event>loop.end</event> </event-group> </config> 1 sec summary events with average time, average value per time, etc. 2 sec
Programmatic configuration • New NetLogger calls (slightly simplified): • Add Events to Summarizer: add_eventpair( “my.event”, my.event.start / my.event.end “nbytes”, value field, e.g. nbytes=131024 “guid”) identifier field • Set summary interval: set_interval(10) 10 second summary interval • Get summary statistics: I = get_stats(“my.event”)
Anomaly Detection • Summarized events can be used for simple anomaly detection • Summarize disk and network throughput every 10 seconds • Generate an alarm if disk or network drops below threshold X for duration Y
Bottleneck Analysis • Can configure summarizer to just output a single summary at the end • Need to collect summary information at both client and server sides • Because the start/end events measure both... • time inside instrumented function (endi - starti) • time between successive calls (starti+1 - endi) • ..there is potential for determining which functions are busy and which are mostly waiting • admittedly, this is somewhat complicated by OS buffering
Summary • Two broad categories of work: • Gathering and normalizing existing data to allow analysis across sites (e.g. in OSG) • syslog-ng, log parser, db loader • missing event detector • anomaly detection • Adding new data through instrumentation of standard Grid middleware • best practices logging recommendations • summarizer
More Information • CEDPS TS home page: • http://www.cedps.net/wiki/index.php/Troubleshooting • Best-practices sub-page: • http://www.cedps.net/wiki/index.php/LoggingBestPractice • CEDPS TS team • Brian Tierney, LBNL (Area lead) • Jen Schopf, ANL • Stu Martin, ANL • Laura Perlman, ISI
NL summarizer performance NL mode Log dest
NL summarizer overhead I/O, compute overlap