240 likes | 255 Views
This research paper investigates the use of HTTP access logs to quickly detect and localize application-level failures in Internet services. The study introduces online algorithms for anomaly detection and demonstrates a GUI tool for real-time detection. The paper also explores future work and presents case studies from Ebates.com.
E N D
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University HT Levine, Ebates.com George Candea, Stanford University
Motivation • problem: • takes weeks/months to detect some failures in Internet services • assumption: • users change their behavior in response to failures • e.g., can’t follow a link from /shopping_cart to /checkout • goal: • quickly detect changes/anomalies in users’ access patterns • localize the cause of the change: • which page is causing problems? • did the page transitions change?
Outline • online algorithms for anomaly detection • demo of a GUI tool for real-time detection • questions we have • future work
Anomalies in user access patterns • why this approach to failure detection? • leverages aggregate intelligence of people using the site • identifying page access patterns can help localize failures • don’t need any instrumentation • types of anomalies • unexpected: signify failures/problems • expected: verify the changes/updates to the website • what types of user patterns we can observe • frequencies of individual pages • page transitions • user sessions
Real-world failures from Ebates.com • Ebates.com • mid-sized e-commerce site • provided 5 sets of HTTP logs (1-2 week period) • have access emails, chat logs from periods of problems • each data set contains one or more failures • mostly site crash • examples • problem with survey pages • broken signup page • bad DB query
1 hit / 5 minutes 10 hits / 5 minutes 100 hits / 5 minutes Normal traffic: 11am – 3am 11am 3am
1 hit / 2 minutes 10 hits / 2 minutes 100 hits / 2 minutes Anomaly: 7am – 1pm 7am 1pm
Online detection of anomalies • assign anomaly score to the current time interval • handling anomalous intervals in the past • use all intervals • less weight on the anomalous intervals • ignore anomalous intervals • localization of problems • most anomalous pages • changes in page transitions time
Two algorithms • chi-square test • count hits to top 40 pages in the past 6 hours and the past 10 minutes • compare relative frequencies using the chi-square test • more sensitive to frequent pages • compare page transitions before and during the anomaly • Naive Bayes anomaly detection • assume that page frequencies are independent • model frequency of each page as a Gaussian • learn mean and variance from the past • anomaly score = 1 - Prob(current interval is normal) • more sensitive to infrequent pages
1 hit / 5 minutes 10 hits / 5 minutes 100 hits / 5 minutes Two Anomalies anomalyscore anomalythreshold time (hours) number of hits to the top 10 pages
GUI tool for real-time detection • why need GUI tool? • build trust of the operators • why should the operator believe the algorithm? • “picture is worth a thousand words” • manual monitoring/inspection of traffic by operators • make SLT usable in real life • report 1 warnings instead of anomalies every minute • compare: warning #3: detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05 Most anomalous pages: /landing.jsp 19.55 /landing_merchant.jsp 19.50 /mall_ctrl.jsp 3.69 /malltop.go 2.63 /mall.go 2.18
Summary of successful results • october 2003 – broken signup page: • noticed the problem 7 days earlier + correctly localized! • november 2003 – account page problem: • 1st warning: 16 hours earlier • 2nd warning: 1 hour earlier + correctly localized the bad page! • july 2001 – landing looping problem: • warning 2 days earlier + correctly localized • detected a failure they didn’t tell us about • detected three other significant anomalies • feedback: “these might have been important, but we didn’t know about them. definitely useful if detected in real-time.”
How to evaluate? • information from HT Levine: • time + root cause of major failures (site down, ...) • time of minor problems (DB alarm, ...) • harmless updates (code push, page update) • scenario: • page A pushed at 3:30pm, Monday • anomaly on page A at 6pm, Monday • mostly ok for next 48 hours • site down at 6pm, Wednesday • would detecting the anomaly on Monday help??
What is a true/false positive? • detected a major/minor problem: GREAT • detected a regular site update: ??? • detected a significant anomaly, BUT • Ebates knows nothing about it • no major problems at that time • ??? • detected anomalies almost every night • certainly a false positive
Build a simulator? • site: PetStore, Rubis? • failures: try failures from Ebates • user simulator: based on real logs from Ebates • cons: • less realistic (how to build a realistic simulator of users?) • pros: • know exactly what happened in the system (measure TTD) • try many different failures • use for evaluating TCQ-based preprocessing
Localization • Naive Bayes better at localization • likely reason: more sensitive to infrequent pages
Future work • develop better quantitative measures for analysis • GUI tool • deploy at EBates • make available as open source • could help convince other companies to provide failure data • detect more subtle problems • harder to detect using current methods • explore HCI aspects of our approach
Conclusions • very simple algorithms can detect serious failures • visualization helps understand the anomaly • have almost-perfect source of failure data • complete HTTP logs • operators willing to cooperate • emails, chat logs from periods of problems • still hard to evaluate