1 / 52

Keeping our websites running - troubleshooting with Appdynamics

Benoit Villaumie Lead Architect. Guillaume Postaire Infrastructure Manager. Keeping our websites running - troubleshooting with Appdynamics. Introduction. Our Company Karavel Founded 2001 #1 package Travel Website in France 4 Million unique visitors a month Mainly B2C, but also B2B

mayten
Download Presentation

Keeping our websites running - troubleshooting with Appdynamics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager Keeping our websites running- troubleshooting with Appdynamics

  2. Introduction • Our Company Karavel • Founded 2001 • #1 package Travel Website in France • 4 Million unique visitors a month • Mainly B2C, but also B2B • 15 brands, 10 white label • One M&A every Year

  3. Our Application History • 2008 – Monolithic Years • Tomcat, MySql • Expensive to maintain & Scale • ‘Too Big To Fail’ • 2009 – Distributed SOA • Tomcat, Web Services & MySql • Easier to maintain & scale • Became incredibly complex to manage • Design for failure

  4. Managing this Complexity • History of Architecture Issues • Slow SQL Queries, Timeout & Pool Exhaustion • Slow 3rd Party Web Services • Open Source Framework Bugs • Resource & Memory Leakages • Long and Painful Firefighting • Plenty of log Files on multiple servers • Thread & Heap Dumps • Few jmx metrics, but never the needed one • Lack of Historical data

  5. Our AppDynamics Experience – Who ? • Today 50+ people in Karavel use AppDynamics: • Products Owners • Developers • Architects • Ops

  6. Our AppDynamics Experience – Root Causes • Memory • Leakage • Over Consumption • Performance Regression • Application Bugs • Architectural Changes • Infrastructure Changes

  7. Our AppDynamics Experience – Methodology • Discard quickly wrong hypotheses=> wide spectrum investigation • Investigate deeper interesting ones • Once under control, create alerts and dashboards • Communicate the methodology to the team

  8. Commons Issues

  9. Commons Issues : Response Time

  10. Commons Issues : Response Time

  11. Commons Issues : Response Time

  12. Commons Issues : Response Time • Analyze functionality on cluster / noderesponse time node mean response time cluster mean response time

  13. Commons Issues : Response Time

  14. Commons Issues : Response Time • Analyze functionality by Business Transaction BT mean response time All BT mean response time

  15. Commons Issues : Response Time

  16. Commons Issues : Response Time • related to a resourceconsumed by the application (databases, webservices, …) • related to a performance regressionimplementation • requestsnapshot & drill down functionality

  17. Commons Issues : Response Time

  18. Commons Issues : Response Time Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn x100 (but depends on your code) GC CPU ms / mn

  19. Commons Issues : Response Time

  20. Commons Issues : Response Time • related to GarbageCollectingOverActivity/!\memoryproblem • Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)

  21. Commons Issues : Response Time

  22. Commons Issues : Response Time • related to a resourceleak (CPU, FD, …) • related to a selfishprocessthatdries server resources (CPU, Thread, FD) • Analyze functionality • Then class/methodfound by Thread Dump • Or ps, vmstat, top Nb of thread

  23. Commons Issues : Errors

  24. Commons Issues : Errors /!\ errors do not mean broken user experience meteo is broken

  25. Commons Issues : Errors • Identify the errorkind and the business transactions • Troubleshoot > Error rates, thenchoose the error class that has a drop in number

  26. Commons Issues : Errors • Identify the errorkind and the business transactions • Troubleshoot > Error rates > details

  27. Commons Issues : Memory

  28. Commons Issues : Memory Memory Problem • Monitor > Application Infrastructure > Memory

  29. Commons Issues : Memory • Memory leak, look at Tenured Gen Behavior

  30. Commons Issues : Memory Then, investigate Object Instance Tracking

  31. Commons Issues : Memory • Memory overconsumption, look at Eden Space

  32. Commons Issues : Memory Then, investigate Object Instance Tracking (again)

  33. Commons Issues : Memory • But sometimes, your VM needs only more memory • Why ?Ask the developers. They should know (?)

  34. Commons Issues : Backend • C process • Mysql backend

  35. Commons Issues : Backend

  36. Commons Issues : Backend

  37. Commons Issues : Backend How to monitor a legacy C socket process ? • Get minimal info and set alert from the consumer process

  38. Commons Issues : Backend We have a problem Mean response time

  39. Commons Issues : Backend Timeout not normal behaviorContact the editor Max response time Mean response time

  40. Commons Issues : Backend Another version New version Editor forces us to stop monitoring Mean response time

  41. Alerts & Dashboards

  42. Alerts & Dashboards : proactive detection • Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard

  43. Alerts & Dashboards : proactive detection • Alerts (ops & devs) : • on response time • on err/mn • on stall Application Health Alerts Criteria

  44. Alerts & Dashboards : simplify resolution • reduce Mean Time Resolution Application Health Dashboard • cluster response time • node response time • node error rate • node call number Application Health Dashboard

  45. Alerts & Dashboards : simplify resolution • reduce Mean Time Resolution Infrastructure Health Dashboard • node memory usage • node CPU usage • node Thread number Infrastructure Health Dashboard

  46. Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks

  47. Weekly Review Our Dashboard Safety Belt • Weekly Performance Review • Weekly Error Review (coming soon) Weekly Performance Dashboard

  48. Capacity planning How to ease : • software tuning • hardware renew • Event planning

  49. Capacity planning

  50. Capacity planning

More Related