Detecting, Managing, and Diagnosing Failures with FUSE

Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP

Goals & Target Environment • Improve the ability of large internet portals to gain insight into failures • Non-goals: • masking failures • use machine learning to inferabnormal behavior

MSN Background • Messenger, www.msn.com, Hotmail, Search, many other “properties” • Large (> 100 million users) • Sources of Complexity: • multiple data-centers • large # of machines • complex internal network topology • diversity of applications and software infrastructure

The Plan • Detecting, managing, and diagnosing failures • Review MSN’s current approaches • Describe our solution at a high level

Detecting Failures • Monitor system availability with heartbeats • Monitor applications availability & quality of service using synthetic requests • Customer complaints • Telephone, email Problems: • These approaches provide limited coverage – harder to catch failures that don’t affect every request • Data on detected failures often lacks necessary detail to suggest a remedy: • which front end is flaky? • which app component caused end-user failure?

Managing Failures Definition: • Ability to prioritize failures • Detect component service degradation • Characterizing app-stability • Capacity planning • When server “x” fails, what is the impact of this failure? • Better use of ops and engineering resources • Current approach: no systematic attempt to provide this functionality

Our solution (in 2 steps) Detecting and Managing Failures • Step 1: Instrument applications to track user requests across the “service chain” • Each request is tagged with a unique id • Service chain is composed on-the-fly with help of app instrumentation • For each request: • Collect per-hop performance information • Collect per-request failure status • Centralized data collection

What kinds of failures? We can handle: • Machine failures • Network connectivity problems Most: • Misconfiguration • Application bugs But not all: • Application errors where app itself doesn’t detect that there is a problem

Diagnosing Failures • Assigning responsibility to a specific hw or sw component • Insight into internals of a component • Cross component interactions • Current approach: instrument applications • App-specific log messages • Problems • High request rates => log rollover • Perceived overhead => detailed logging enabled during testing, disabled in production

Fuse Background • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred • Lack of a positive ack => failure

Step 2: Conditional Logging • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain • While fate is undecided: Detailed log messages stored in main memory • Common case overload of logging is vastly reduced • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures • Quantity of data generated is manageable, when most requests are successful

Client Server1 Server2 Server3 X Example Benefits: • FUSE allows monitoring of real transactions. • All transactions, or a sampled subset to control overhead. • When a request fails, FUSE provides an audit trail • How far did it get? • How long did each step take? • Any additional application specific context. • FUSE can be deployed incrementally.

Issues • Overload policy: need to handle bursts of failures without inducing more failures • How much effort to make apps FUSE enabled? • Are the right components FUSE enabled? • Identifying and filtering false positives • Tracking request flow is non-trivial with network load balancers

Status • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine • Testing in progress • Roll-out at end of summer

Backups

FUSE is Easy to Integrate Example current code on Front End: ReceiveRequestFromClient(…) { … SendRequestToBackEnd(…); } Example code on Front End using FUSE: ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); … SendRequestToBackEnd(…, f ); } Current implementation is in C#, and consists of 2400 LOC

Detecting, Managing, and Diagnosing Failures with FUSE