Techniques for Monitoring Large Loosely-coupled Cluster Jobs

Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National Laboratory GGF

Tightly Coupled vs. Loosely Coupled • Cluster applications can be classified as follows: • Tightly Coupled: jobs have a large amount of communication between nodes, usually using specialized interfaces such as the Message Passing Interface (MPI) • Loosely Coupled: jobs have occasional synchronization points, but are largely independent • Uncoupled: jobs have no communication or synchronization points • An important class of parallel processing jobs on clusters today are workflow-based applications that process large amounts of data in parallel • e.g.: searching for supernovae or Higgs particles • In this context we define workflow as the processing steps required to analyze a unit of data GGF

Uncoupled / Loosely Coupled Jobs • Often this type of computing is I/O or database bound, not CPU bound. • Performance analysis requires system-wide analysis of competition for resources such as disk arrays and database tables • This is very different from traditional parallel processing analysis of CPU usage and explicitly synchronized communications • There are a number of performance analysis tools which focus on tightly coupled applications. • we are focused on uncoupled and loosely coupled applications GGF

Tools for Tightly Coupled Jobs • Traditional parallel computing performance analysis tools focus on CPU usage, communication, and memory access patterns. e.g.: • TAU (http://www.csi.uoregon.edu/nacse/tau/) • Paraver (http://www.cepba.upc.edu/paraver/overview.htm) • FPMPI (http://www-unix.mcs.anl.gov/fpmpi/WWW/) • Intel Trace Collector (http://www.intel.com/software/products/cluster/tcollector/) • A number of other projects that started out as mainly for tightly coupled applications • Then were extended or adapted to work for loosely coupled systems as well • These include: • SvPablo (http://www.renci.unc.edu/Project/SVPablo/SvPabloOverview.htm) • Paradyn (http://www.paradyn.org/) • Prophesy(http://prophesy.cs.tamu.edu/) GGF

Sample Loosely Coupled Job • An example of an uncoupled cluster application is the Nearby Supernova Factory (SNfactory) project at LBL • Mission: to find and analyze nearby “Type Ia” supernovae • http://snfactory.lbl.gov/ • SNfactory jobs are submitted to PDSF cluster at NERSC, and typically run on 64-128 nodes • SNfactory jobs produce about one monitoring event per second on each node • total of roughly up to 1,100,000 events per day • Roughly 1% of jobs were failing for unknown reasons • SNfactory group came to us for help GGF

Sample Loosely Coupled Job GGF

Sample Distribution of Job completion Time Q: What is the cause of the very long tail? GGF

NetLogger Toolkit • We have developed the NetLogger Toolkit (short for Networked Application Logger), which includes: • tools to make it easy for distributed applications to log interesting events at every critical point • tools for host and network monitoring • The approach combines network, host, and application-level monitoring to provide a complete view of the entire system. • This has proven invaluable for: • isolating and correcting performance bottlenecks • debugging distributed applications GGF

NetLogger Components • NetLogger Toolkit contains the following components: • NetLogger message format • NetLogger client library (C, Java, Python, Perl) • NetLogger visualization tools • NetLogger host/network monitoring tools • Additional critical component for distributed applications: • NTP (Network Time Protocol) or GPS host clock is required to synchronize the clocks of all systems GGF

NetLogger Methodology • NetLogger is both a methodology for analyzing distributed systems, and a set of tools to help implement the methodology. • You can use the NetLogger methodology without using any of the LBNL provided tools. • The NetLogger methodology consists of the following: • All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better. • All monitoring events must use a common format and common set of attributes and a globally synchronized timestamp • Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network) • Collect all log data in a central location • Use event correlation and visualization tools to analyze the monitoring event logs GGF

NetLogger Analysis: Key Concepts • NetLogger visualization tools are based on time correlated and object correlated events. • precision timestamps (default = microsecond) • If applications specify an “object ID” for related events, this allows the NetLogger visualization tools to generate an object “lifeline” • In order to associate a group of events into a “lifeline”, you must assign an “object ID” to each NetLogger event • Sample Event ID: file name, block ID, frame ID, Grid Job ID, etc. GGF

Sample NetLogger Instrumentation log = netlogger.LogOutputStream(“my.log”) done = 0 while not done: log.write("EVENT.START",{"TEST.SIZE”:size}) # perform the task to be monitored done = do_something(data,size) log.write("EVENT.END”,{}) • Sample Event: t DATE=20000330112320.957943 s HOST=gridhost.lbl.gov s PROG=gridApp l LVL=Info s NL.EVNT=WriteData l SEND.SZ=49332 GGF

SNfactory Lifelines GGF

Scaling Issues • Running a large number of workflows on a cluster will generate far too much monitoring data to be able to use the standard NetLogger lifeline visualization techniques to spot problems. • For even a small set of nodes, these plots can be very dense GGF

Anomaly Detection • To address this problem, we designed and developed a new NetLogger automatic anomaly detection tool, called nlfindmissing • The basic idea is to identify lifelines that are missing events. • Users define the events that make up an important linear sequence within the workflow, as a lifeline. • The tool then outputs the incomplete lifelines on a data file or stream. GGF

Lifeline Timeouts • Issue: given an open-ended dataset that is too large to fit in memory, how to determine when to give up waiting for a lifeline to complete? • Our solution: • approximate the density function of the lifeline latencies by maintaining a histogram with a relatively large (e.g. 1000) bins • the timeout becomes a user-selected section of the tail of that histogram, e.g. the 99th percentile • This works well, runs in a fixed memory footprint, is computationally cheap, and does not rely on any assumptions about the distribution of the data • additional parameters, such as a minimum and maximum timeout value, and how many lifelines to use as a ``baseline'' for dynamic calculations, make the method more robust to messy real-world data GGF

Anomalies Only GGF

Anomalies Plus Context GGF

NetLogger Cluster Deployment GGF

Monitoring Data Management Issues • A challenge for application instrumentation on large clusters is sifting through the volume of data that even a modest amount of instrumentation can generate. • For example, • a 24 hour application run produces 50MB of application and host monitoring data per node • while a 32-node cluster might be almost manageable (50 MB x 32 nodes = 1.6 GB) • when scaled to a 512-node cluster the amount of data starts to become quite unwieldy (50 x 512 = 25.6 GB). GGF

Data Collection GGF

nldemux • The nldemux tool is then used to group monitoring data into manageable pieces. • the ganglia data is placed in its own directory, and data from each node is written to a separate file; • the entire directory is rolled over once per day. • the workflow data are placed in files named for the observation date at the telescope, • this information is carried in each event record • data is removed after 3 weeks GGF

NetLogger and Grid Job IDs GGF

Grid Workflow Identifiers (GIDs) • Globally unique key needed to identify a workflow • Propagated down and across workflow components • This is the hard part! • Options: • modify app. interfaces • add SOAP header Acronyms: RFT = Reliable File Transfer service GridFTP = Grid File Transfer Protocol PBS = Portable Batch System HPSS = High Performance Storage System SRM = Storage Resource Manager GGF

Without GIDs GGF

With GIDs GGF

For More Information • http://dsd.lbl.gov/NetLogger/ • Source code (open source) and publications available GGF

Techniques for Monitoring Large Loosely-coupled Cluster Jobs

Techniques for Monitoring Large Loosely-coupled Cluster Jobs

Presentation Transcript

Toward Loosely Coupled Programming on Petascale Systems

The Chubby lock service for loosely-coupled distributed systems

Loosely-coupled Component Infrastructure Using LooCI Contiki / OSGi / Android

ROMA: Reliable Overlay Multicast with Loosely Coupled TCP Connections

The Chubby Lock Service for Loosely-coupled Distributed Systems

MonALISA for Cluster Monitoring

A Loosely Coupled Ocean-Atmosphere Ensemble Assimilation System.

Cluster Coordination Performance Monitoring

Leveraging W3C Linked Data, OSLC, and Open Source for Loosely Coupled Application Integrations

Flexible and Efficient Control of Data Transfers for Loosely Coupled Components

Presentación: “Loosely Coupled Traceability for ATL” Frederic Jouault 2005

Learning activities loosely coupled with Sakai @ UCT

The Chubby Lock Service for Loosely-coupled Distributed Systems

Taking Advantages of Collective Operation Semantics for Loosely Coupled Simulations

Loosely coupled OPC client used to animate GIS

Loosely Coupled Sakai

Cluster Coordination Performance Monitoring

Late Typing for Loosely Coupled Recursion

in Large-Scale Cluster

Maintaining XPath Views in Loosely Coupled Systems

Loosely Coupled Parallelism: Clusters

Techniques for monitoring and control