1 / 24

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

This research paper investigates the use of HTTP access logs to quickly detect and localize application-level failures in Internet services. The study introduces online algorithms for anomaly detection and demonstrates a GUI tool for real-time detection. The paper also explores future work and presents case studies from Ebates.com.

lkoch
Download Presentation

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University HT Levine, Ebates.com George Candea, Stanford University

  2. Motivation • problem: • takes weeks/months to detect some failures in Internet services • assumption: • users change their behavior in response to failures • e.g., can’t follow a link from /shopping_cart to /checkout • goal: • quickly detect changes/anomalies in users’ access patterns • localize the cause of the change: • which page is causing problems? • did the page transitions change?

  3. Outline • online algorithms for anomaly detection • demo of a GUI tool for real-time detection • questions we have • future work

  4. Anomalies in user access patterns • why this approach to failure detection? • leverages aggregate intelligence of people using the site • identifying page access patterns can help localize failures • don’t need any instrumentation • types of anomalies • unexpected: signify failures/problems • expected: verify the changes/updates to the website • what types of user patterns we can observe • frequencies of individual pages • page transitions • user sessions

  5. Real-world failures from Ebates.com • Ebates.com • mid-sized e-commerce site • provided 5 sets of HTTP logs (1-2 week period) • have access emails, chat logs from periods of problems • each data set contains one or more failures • mostly site crash • examples • problem with survey pages • broken signup page • bad DB query

  6. 1 hit / 5 minutes 10 hits / 5 minutes 100 hits / 5 minutes Normal traffic: 11am – 3am 11am 3am

  7. 1 hit / 2 minutes 10 hits / 2 minutes 100 hits / 2 minutes Anomaly: 7am – 1pm 7am 1pm

  8. Online detection of anomalies • assign anomaly score to the current time interval • handling anomalous intervals in the past • use all intervals • less weight on the anomalous intervals • ignore anomalous intervals • localization of problems • most anomalous pages • changes in page transitions time

  9. Two algorithms • chi-square test • count hits to top 40 pages in the past 6 hours and the past 10 minutes • compare relative frequencies using the chi-square test • more sensitive to frequent pages • compare page transitions before and during the anomaly • Naive Bayes anomaly detection • assume that page frequencies are independent • model frequency of each page as a Gaussian • learn mean and variance from the past • anomaly score = 1 - Prob(current interval is normal) • more sensitive to infrequent pages

  10. 1 hit / 5 minutes 10 hits / 5 minutes 100 hits / 5 minutes Two Anomalies anomalyscore anomalythreshold time (hours) number of hits to the top 10 pages

  11. GUI tool for real-time detection • why need GUI tool? • build trust of the operators • why should the operator believe the algorithm? • “picture is worth a thousand words” • manual monitoring/inspection of traffic by operators • make SLT usable in real life • report 1 warnings instead of anomalies every minute • compare: warning #3: detection time: Sun Nov 16 19:27:00 PST 2003 start: Sun Nov 16 19:24:00 PST 2003 end: Sun Nov 16 21:05:00 PST 2003 significance = 7.05 Most anomalous pages: /landing.jsp 19.55 /landing_merchant.jsp 19.50 /mall_ctrl.jsp 3.69 /malltop.go 2.63 /mall.go 2.18

  12. Summary of successful results • october 2003 – broken signup page: • noticed the problem 7 days earlier + correctly localized! • november 2003 – account page problem: • 1st warning: 16 hours earlier • 2nd warning: 1 hour earlier + correctly localized the bad page! • july 2001 – landing looping problem: • warning 2 days earlier + correctly localized • detected a failure they didn’t tell us about • detected three other significant anomalies • feedback: “these might have been important, but we didn’t know about them. definitely useful if detected in real-time.”

  13. Oct 2003 – broken signup page (1)

  14. Oct 2003 – broken signup page (2)

  15. Oct 2003 – broken signup page (3)

  16. Oct 2003 – broken signup page (4)

  17. Nov 2003 – account page problem (1)

  18. Nov 2003 – account page problem (2) 9am 1pm

  19. How to evaluate? • information from HT Levine: • time + root cause of major failures (site down, ...) • time of minor problems (DB alarm, ...) • harmless updates (code push, page update) • scenario: • page A pushed at 3:30pm, Monday • anomaly on page A at 6pm, Monday • mostly ok for next 48 hours • site down at 6pm, Wednesday • would detecting the anomaly on Monday help??

  20. What is a true/false positive? • detected a major/minor problem: GREAT • detected a regular site update: ??? • detected a significant anomaly, BUT • Ebates knows nothing about it • no major problems at that time • ??? • detected anomalies almost every night • certainly a false positive

  21. Build a simulator? • site: PetStore, Rubis? • failures: try failures from Ebates • user simulator: based on real logs from Ebates • cons: • less realistic (how to build a realistic simulator of users?) • pros: • know exactly what happened in the system (measure TTD) • try many different failures • use for evaluating TCQ-based preprocessing

  22. Localization • Naive Bayes better at localization • likely reason: more sensitive to infrequent pages

  23. Future work • develop better quantitative measures for analysis • GUI tool • deploy at EBates • make available as open source • could help convince other companies to provide failure data • detect more subtle problems • harder to detect using current methods • explore HCI aspects of our approach

  24. Conclusions • very simple algorithms can detect serious failures • visualization helps understand the anomaly • have almost-perfect source of failure data • complete HTTP logs • operators willing to cooperate • emails, chat logs from periods of problems • still hard to evaluate

More Related