270 likes | 422 Views
Techniques for Monitoring Large Loosely-coupled Cluster Jobs. Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National Laboratory. Tightly Coupled vs. Loosely Coupled. Cluster applications can be classified as follows:
E N D
Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National Laboratory GGF
Tightly Coupled vs. Loosely Coupled • Cluster applications can be classified as follows: • Tightly Coupled: jobs have a large amount of communication between nodes, usually using specialized interfaces such as the Message Passing Interface (MPI) • Loosely Coupled: jobs have occasional synchronization points, but are largely independent • Uncoupled: jobs have no communication or synchronization points • An important class of parallel processing jobs on clusters today are workflow-based applications that process large amounts of data in parallel • e.g.: searching for supernovae or Higgs particles • In this context we define workflow as the processing steps required to analyze a unit of data GGF
Uncoupled / Loosely Coupled Jobs • Often this type of computing is I/O or database bound, not CPU bound. • Performance analysis requires system-wide analysis of competition for resources such as disk arrays and database tables • This is very different from traditional parallel processing analysis of CPU usage and explicitly synchronized communications • There are a number of performance analysis tools which focus on tightly coupled applications. • we are focused on uncoupled and loosely coupled applications GGF
Tools for Tightly Coupled Jobs • Traditional parallel computing performance analysis tools focus on CPU usage, communication, and memory access patterns. e.g.: • TAU (http://www.csi.uoregon.edu/nacse/tau/) • Paraver (http://www.cepba.upc.edu/paraver/overview.htm) • FPMPI (http://www-unix.mcs.anl.gov/fpmpi/WWW/) • Intel Trace Collector (http://www.intel.com/software/products/cluster/tcollector/) • A number of other projects that started out as mainly for tightly coupled applications • Then were extended or adapted to work for loosely coupled systems as well • These include: • SvPablo (http://www.renci.unc.edu/Project/SVPablo/SvPabloOverview.htm) • Paradyn (http://www.paradyn.org/) • Prophesy(http://prophesy.cs.tamu.edu/) GGF
Sample Loosely Coupled Job • An example of an uncoupled cluster application is the Nearby Supernova Factory (SNfactory) project at LBL • Mission: to find and analyze nearby “Type Ia” supernovae • http://snfactory.lbl.gov/ • SNfactory jobs are submitted to PDSF cluster at NERSC, and typically run on 64-128 nodes • SNfactory jobs produce about one monitoring event per second on each node • total of roughly up to 1,100,000 events per day • Roughly 1% of jobs were failing for unknown reasons • SNfactory group came to us for help GGF
Sample Distribution of Job completion Time Q: What is the cause of the very long tail? GGF
NetLogger Toolkit • We have developed the NetLogger Toolkit (short for Networked Application Logger), which includes: • tools to make it easy for distributed applications to log interesting events at every critical point • tools for host and network monitoring • The approach combines network, host, and application-level monitoring to provide a complete view of the entire system. • This has proven invaluable for: • isolating and correcting performance bottlenecks • debugging distributed applications GGF
NetLogger Components • NetLogger Toolkit contains the following components: • NetLogger message format • NetLogger client library (C, Java, Python, Perl) • NetLogger visualization tools • NetLogger host/network monitoring tools • Additional critical component for distributed applications: • NTP (Network Time Protocol) or GPS host clock is required to synchronize the clocks of all systems GGF
NetLogger Methodology • NetLogger is both a methodology for analyzing distributed systems, and a set of tools to help implement the methodology. • You can use the NetLogger methodology without using any of the LBNL provided tools. • The NetLogger methodology consists of the following: • All components must be instrumented to produce monitoring These components include application software, middleware, operating system, and networks. The more components that are instrumented the better. • All monitoring events must use a common format and common set of attributes and a globally synchronized timestamp • Log all of the following events: Entering and exiting any program or software component, and begin/end of all IO (disk and network) • Collect all log data in a central location • Use event correlation and visualization tools to analyze the monitoring event logs GGF
NetLogger Analysis: Key Concepts • NetLogger visualization tools are based on time correlated and object correlated events. • precision timestamps (default = microsecond) • If applications specify an “object ID” for related events, this allows the NetLogger visualization tools to generate an object “lifeline” • In order to associate a group of events into a “lifeline”, you must assign an “object ID” to each NetLogger event • Sample Event ID: file name, block ID, frame ID, Grid Job ID, etc. GGF
Sample NetLogger Instrumentation log = netlogger.LogOutputStream(“my.log”) done = 0 while not done: log.write("EVENT.START",{"TEST.SIZE”:size}) # perform the task to be monitored done = do_something(data,size) log.write("EVENT.END”,{}) • Sample Event: t DATE=20000330112320.957943 s HOST=gridhost.lbl.gov s PROG=gridApp l LVL=Info s NL.EVNT=WriteData l SEND.SZ=49332 GGF
Scaling Issues • Running a large number of workflows on a cluster will generate far too much monitoring data to be able to use the standard NetLogger lifeline visualization techniques to spot problems. • For even a small set of nodes, these plots can be very dense GGF
Anomaly Detection • To address this problem, we designed and developed a new NetLogger automatic anomaly detection tool, called nlfindmissing • The basic idea is to identify lifelines that are missing events. • Users define the events that make up an important linear sequence within the workflow, as a lifeline. • The tool then outputs the incomplete lifelines on a data file or stream. GGF
Lifeline Timeouts • Issue: given an open-ended dataset that is too large to fit in memory, how to determine when to give up waiting for a lifeline to complete? • Our solution: • approximate the density function of the lifeline latencies by maintaining a histogram with a relatively large (e.g. 1000) bins • the timeout becomes a user-selected section of the tail of that histogram, e.g. the 99th percentile • This works well, runs in a fixed memory footprint, is computationally cheap, and does not rely on any assumptions about the distribution of the data • additional parameters, such as a minimum and maximum timeout value, and how many lifelines to use as a ``baseline'' for dynamic calculations, make the method more robust to messy real-world data GGF
Anomalies Only GGF
Monitoring Data Management Issues • A challenge for application instrumentation on large clusters is sifting through the volume of data that even a modest amount of instrumentation can generate. • For example, • a 24 hour application run produces 50MB of application and host monitoring data per node • while a 32-node cluster might be almost manageable (50 MB x 32 nodes = 1.6 GB) • when scaled to a 512-node cluster the amount of data starts to become quite unwieldy (50 x 512 = 25.6 GB). GGF
Data Collection GGF
nldemux • The nldemux tool is then used to group monitoring data into manageable pieces. • the ganglia data is placed in its own directory, and data from each node is written to a separate file; • the entire directory is rolled over once per day. • the workflow data are placed in files named for the observation date at the telescope, • this information is carried in each event record • data is removed after 3 weeks GGF
Grid Workflow Identifiers (GIDs) • Globally unique key needed to identify a workflow • Propagated down and across workflow components • This is the hard part! • Options: • modify app. interfaces • add SOAP header Acronyms: RFT = Reliable File Transfer service GridFTP = Grid File Transfer Protocol PBS = Portable Batch System HPSS = High Performance Storage System SRM = Storage Resource Manager GGF
Without GIDs GGF
With GIDs GGF
For More Information • http://dsd.lbl.gov/NetLogger/ • Source code (open source) and publications available GGF