160 likes | 373 Views
Detecting, Managing, and Diagnosing Failures with FUSE. John Dunagan, Juhan Lee (MSN), Alec Wolman WIP. Goals & Target Environment. Improve the ability of large internet portals to gain insight into failures Non-goals: masking failures use machine learning to infer abnormal behavior.
E N D
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP
Goals & Target Environment • Improve the ability of large internet portals to gain insight into failures • Non-goals: • masking failures • use machine learning to inferabnormal behavior
MSN Background • Messenger, www.msn.com, Hotmail, Search, many other “properties” • Large (> 100 million users) • Sources of Complexity: • multiple data-centers • large # of machines • complex internal network topology • diversity of applications and software infrastructure
The Plan • Detecting, managing, and diagnosing failures • Review MSN’s current approaches • Describe our solution at a high level
Detecting Failures • Monitor system availability with heartbeats • Monitor applications availability & quality of service using synthetic requests • Customer complaints • Telephone, email Problems: • These approaches provide limited coverage – harder to catch failures that don’t affect every request • Data on detected failures often lacks necessary detail to suggest a remedy: • which front end is flaky? • which app component caused end-user failure?
Managing Failures Definition: • Ability to prioritize failures • Detect component service degradation • Characterizing app-stability • Capacity planning • When server “x” fails, what is the impact of this failure? • Better use of ops and engineering resources • Current approach: no systematic attempt to provide this functionality
Our solution (in 2 steps) Detecting and Managing Failures • Step 1: Instrument applications to track user requests across the “service chain” • Each request is tagged with a unique id • Service chain is composed on-the-fly with help of app instrumentation • For each request: • Collect per-hop performance information • Collect per-request failure status • Centralized data collection
What kinds of failures? We can handle: • Machine failures • Network connectivity problems Most: • Misconfiguration • Application bugs But not all: • Application errors where app itself doesn’t detect that there is a problem
Diagnosing Failures • Assigning responsibility to a specific hw or sw component • Insight into internals of a component • Cross component interactions • Current approach: instrument applications • App-specific log messages • Problems • High request rates => log rollover • Perceived overhead => detailed logging enabled during testing, disabled in production
Fuse Background • FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurred • Lack of a positive ack => failure
Step 2: Conditional Logging • Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chain • Step 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chain • While fate is undecided: Detailed log messages stored in main memory • Common case overload of logging is vastly reduced • Once the fate of service chain is decided, we discard app logs for successful requests and save logs for failures • Quantity of data generated is manageable, when most requests are successful
Client Server1 Server2 Server3 X Example Benefits: • FUSE allows monitoring of real transactions. • All transactions, or a sampled subset to control overhead. • When a request fails, FUSE provides an audit trail • How far did it get? • How long did each step take? • Any additional application specific context. • FUSE can be deployed incrementally.
Issues • Overload policy: need to handle bursts of failures without inducing more failures • How much effort to make apps FUSE enabled? • Are the right components FUSE enabled? • Identifying and filtering false positives • Tracking request flow is non-trivial with network load balancers
Status • We’ve implemented FUSE for MSN, integrated with ASP.NET rendering engine • Testing in progress • Roll-out at end of summer
FUSE is Easy to Integrate Example current code on Front End: ReceiveRequestFromClient(…) { … SendRequestToBackEnd(…); } Example code on Front End using FUSE: ReceiveRequestFromClient(…, FUSEinfo f) { // default value of f = null if ( f != null ) JoinFUSEGroup( f ); … SendRequestToBackEnd(…, f ); } Current implementation is in C#, and consists of 2400 LOC