High Performance Monitoring

High Performance Monitoring WG on Storage Federations December 6, 2012 Andrew Hanushevsky, SLAC http://xrootd.org

Setting The Context • High Performance Monitoring • Collecting real-time information at statistically significant detail without impacting client or server performance that works at scale. • The relevant phrases • Real-time information • Statistically significant • Without impacting performance • At scale

At Scale • 1000’s of users • 10,000 or more simultaneous jobs • 100,000 or more active files • Geographically distributed across • Thousands of data servers • Hundreds of millions of files • Hundreds of peta-bytes of data • Potentially billions of events every second!

Without Impacting Performance • This requires careful collection & reporting • Many trade-offs but generally • Highly encoded data to minimize traffic • Typically implies binary encoding • Offloading information serialization • More on this at the end • Network protocol that is fast and does not block • Typically implies using UDP

Statistically Significant I • All events need not be 100% time accurate • No need to time-stamp each event • We can’t as server performance would suffer • So, we can report events in time-windows • Events are statistically post-distributed in the window • Note that events are reported in occurrence order • Any event is disposable • This means we can loose events • Allows use of non-blocking UDP packets for reporting

Statistically Significant II • Statistical significance relies on a large sample • We want the big picture • This is monitoring not accounting! • Build it up using a large number of events • And we can get a large number every second • But we don’t expect to get every event • This helps us achieve high performance • Yet provides a reasonably accurate picture

Real Time Information • Reporting events close to the time they happen • Regulated by the size of the window • Typically, in the seconds (e.g. 5 or 10, maybe longer) • What information? • Practically anything that might happen. . . . • Logins and logouts • File operations (open, close, remove, etc) • File I/O (i.e. reads and writes) • Request redirections

A Practical Implementation • xrootdprovides a wide range of monitoring data at high performance • Information is broken out into streams • Asynchronous information packets for • Periodic summary data • Summary stream • Low event rate allows for it to be xml based • Real time detail data • F, M, R, T streams • Potentially high event rates necessitates binary format

Why Streams? • Allows one to easily • Group related information together • Independently select the level of detail in each group • Route information to different collectors • These can be specialized for each stream • Control the performance impact of each stream • Streams can be selectively enabled • Makes it easier to handle the raw data

The Summary Stream • Summary data periodically reported • Very large amount of data available • http://xrootd.org/doc/prod/xrd_monitoring.htm • Selectableby category • Centrally collected • Collector merges reporters • Fed into your favorite monitoring system • Ganglia, GRIS, Nagios, MonALISA, etc • Relatively low amount of traffic – negligible impact

The Real Time Streams • Easily> 50 MB/Sec of complex inter-related asynchronous monitoring data • Collector needs to be fast and robust • May need to cross-reference certain streams • Store the data is an easily analyzable format • E.g.mySQLor root files • Condense the information for suitable rendering • Send it to the rendering agent • E.g. via active MQ to the dashboard • High amount of traffic – high impact

The Real Time M Stream • The Map stream • Server, user, and file names mapped to binary id’s • The id’s are used in other streams as backward refs • Allows >100x compression of redundant information • Gross file events • Purges (auto-removals) & stage-ins (auto-transfers) • Client generated event data • Job name, site, and performance data • Selectable detail levels • Typically, less than 1% overhead

The Real Time F Stream • The File stream • Per-file I/O summary information • Bytes read, written vs method used • Sigma values for byte and operation counts • Per-file I/O progress information • Periodic report on bytes transferred • Selectable detail levels • 1 to 3% overhead

The Real Time R Stream • The Redirect stream • Source to destination redirect information • Operation causing the redirect • Generated by any server that redirects clients • No selectable detail levels • Pretty much all of the information is needed • About 1% overhead

The Real Time T Stream • The Trace stream • Per-file I/O information • Offset and bytes read or written for each operation • Identical to a seek trace • Selectable detail levels • 3 to 5% overhead

Back To Offloading • Recall xrootdmonitoring is async multi-stream • This means that the collector must time order the data as the server does not do this • Each packet has enough information to do this • We do this because serialization is very expensive • Extremely high impact in a multi-threaded application • The hard work is offloaded to another server • Allows the data server to concentrate on delivering user data not monitoring data

Conclusion I • High performance monitoring is hard work • It requires minute attention to detail • Data formats • Work load distribution • Non-blocking internal data structures • Information flow • We estimate that for xrootdit took about four person years to achieve an extremely low level of server performance impact • Making real-time monitoring practical at scale

Conclusion II • Federations create an extreme scale system • Viewed as a single complex big data system • The outlined information is needed to asses it • Only practical with high performance monitoring • In essence • High performance real-time monitoring is a must to properly track federated storage systems

High Performance Monitoring

High Performance Monitoring

Presentation Transcript

Tools for High Performance Network Monitoring

Performance Monitoring

High Frequency Performance Monitoring

PI Performance Monitoring

Monitoring Contractor Performance

High Performance Network Monitoring Challenges for Grids

High Performance Network Monitoring for UltraLight

Performance Monitoring

Monitoring e2e Performance on High-speed Networks

Performance monitoring

High Performance Active End-to-end Network Monitoring

Software Performance Monitoring

Control Performance Monitoring

Performance Monitoring

Network Performance Monitoring

Performance Monitoring

HIGH PERFORMANCE

Application Performance Monitoring

Monitoring e2e Performance on High-speed Networks

Tools for High Performance Network Monitoring