Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University HT Levine, Ebates.com George Candea, Stanford University

Motivation • problem: • takes weeks/months to detect some failures in Internet services • assumption: • users change their behavior in response to failures • e.g., can’t follow a link from /shopping_cart to /checkout • goal: • quickly detect changes/anomalies in users’ access patterns • localize the cause of the change: • which page is causing problems? • did the page transitions change?

Outline • online algorithms for anomaly detection • demo of a GUI tool for real-time detection • questions we have • future work

Anomalies in user access patterns • why this approach to failure detection? • leverages aggregate intelligence of people using the site • identifying page access patterns can help localize failures • don’t need any instrumentation • types of anomalies • unexpected: signify failures/problems • expected: verify the changes/updates to the website • what types of user patterns we can observe • frequencies of individual pages • page transitions • user sessions

Real-world failures from Ebates.com • Ebates.com • mid-sized e-commerce site • provided 5 sets of HTTP logs (1-2 week period) • have access emails, chat logs from periods of problems • each data set contains one or more failures • mostly site crash • examples • problem with survey pages • broken signup page • bad DB query

1 hit / 5 minutes 10 hits / 5 minutes 100 hits / 5 minutes Normal traffic: 11am – 3am 11am 3am

1 hit / 2 minutes 10 hits / 2 minutes 100 hits / 2 minutes Anomaly: 7am – 1pm 7am 1pm

Online detection of anomalies • assign anomaly score to the current time interval • handling anomalous intervals in the past • use all intervals • less weight on the anomalous intervals • ignore anomalous intervals • localization of problems • most anomalous pages • changes in page transitions time

Two algorithms • chi-square test • count hits to top 40 pages in the past 6 hours and the past 10 minutes • compare relative frequencies using the chi-square test • more sensitive to frequent pages • compare page transitions before and during the anomaly • Naive Bayes anomaly detection • assume that page frequencies are independent • model frequency of each page as a Gaussian • learn mean and variance from the past • anomaly score = 1 - Prob(current interval is normal) • more sensitive to infrequent pages

1 hit / 5 minutes 10 hits / 5 minutes 100 hits / 5 minutes Two Anomalies anomalyscore anomalythreshold time (hours) number of hits to the top 10 pages

GUI tool for real-time detection • why need GUI tool? • build trust of the operators • why should the operator believe the algorithm? • “picture is worth a thousand words” • manual monitoring/inspection of traffic by operators • make SLT usable in real life • report 1 warnings instead of anomalies every minute • compare: warning #3: detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05 Most anomalous pages: /landing.jsp 19.55 /landing_merchant.jsp 19.50 /mall_ctrl.jsp 3.69 /malltop.go 2.63 /mall.go 2.18

Summary of successful results • october 2003 – broken signup page: • noticed the problem 7 days earlier + correctly localized! • november 2003 – account page problem: • 1st warning: 16 hours earlier • 2nd warning: 1 hour earlier + correctly localized the bad page! • july 2001 – landing looping problem: • warning 2 days earlier + correctly localized • detected a failure they didn’t tell us about • detected three other significant anomalies • feedback: “these might have been important, but we didn’t know about them. definitely useful if detected in real-time.”

Oct 2003 – broken signup page (1)

Nov 2003 – account page problem (1)

Nov 2003 – account page problem (2) 9am 1pm

How to evaluate? • information from HT Levine: • time + root cause of major failures (site down, ...) • time of minor problems (DB alarm, ...) • harmless updates (code push, page update) • scenario: • page A pushed at 3:30pm, Monday • anomaly on page A at 6pm, Monday • mostly ok for next 48 hours • site down at 6pm, Wednesday • would detecting the anomaly on Monday help??

What is a true/false positive? • detected a major/minor problem: GREAT • detected a regular site update: ??? • detected a significant anomaly, BUT • Ebates knows nothing about it • no major problems at that time • ??? • detected anomalies almost every night • certainly a false positive

Build a simulator? • site: PetStore, Rubis? • failures: try failures from Ebates • user simulator: based on real logs from Ebates • cons: • less realistic (how to build a realistic simulator of users?) • pros: • know exactly what happened in the system (measure TTD) • try many different failures • use for evaluating TCQ-based preprocessing

Localization • Naive Bayes better at localization • likely reason: more sensitive to infrequent pages

Future work • develop better quantitative measures for analysis • GUI tool • deploy at EBates • make available as open source • could help convince other companies to provide failure data • detect more subtle problems • harder to detect using current methods • explore HCI aspects of our approach

Conclusions • very simple algorithms can detect serious failures • visualization helps understand the anomaly • have almost-perfect source of failure data • complete HTTP logs • operators willing to cooperate • emails, chat logs from periods of problems • still hard to evaluate

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

Presentation Transcript

Access to the Internet

Portal Postmortem Using Java Thread Dumps to Diagnose Application Failures

Using Models to Detect Outliers in Refinery Data

Scale Up Access to your 4GL Application using Web Services

Troubleshooting Failures and Reading Logs

Internet Access to Data

Application of Time-Domain Reflectometry To Detect Interconnect Failures

Using logs to Linearise the Data

Logs – Solve USING LOGS METHOD

Solving application problems using logs…

BFD to detect LinkAgg link failures

ARCHER: Using Symbolic, Pathsensitive Analysis to Detect Memory Access Errors

Using Your D.U.I.D Detect Application

Providing Application-Level Assurances Using DNSSEC

Using the Internet to access Want Ads

Access to Services

Shared Server/Shared Internet Access Application

Logs Miner : Portal for Data Mining Web Access Logs

How To Detect Fake Towing services

Using Well Logs (e-logs) in the Petroleum industry

Internet Access