530 likes | 813 Views
Benoit Villaumie Lead Architect. Guillaume Postaire Infrastructure Manager. Keeping our websites running - troubleshooting with Appdynamics. Introduction. Our Company Karavel Founded 2001 #1 package Travel Website in France 4 Million unique visitors a month Mainly B2C, but also B2B
E N D
Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager Keeping our websites running- troubleshooting with Appdynamics
Introduction • Our Company Karavel • Founded 2001 • #1 package Travel Website in France • 4 Million unique visitors a month • Mainly B2C, but also B2B • 15 brands, 10 white label • One M&A every Year
Our Application History • 2008 – Monolithic Years • Tomcat, MySql • Expensive to maintain & Scale • ‘Too Big To Fail’ • 2009 – Distributed SOA • Tomcat, Web Services & MySql • Easier to maintain & scale • Became incredibly complex to manage • Design for failure
Managing this Complexity • History of Architecture Issues • Slow SQL Queries, Timeout & Pool Exhaustion • Slow 3rd Party Web Services • Open Source Framework Bugs • Resource & Memory Leakages • Long and Painful Firefighting • Plenty of log Files on multiple servers • Thread & Heap Dumps • Few jmx metrics, but never the needed one • Lack of Historical data
Our AppDynamics Experience – Who ? • Today 50+ people in Karavel use AppDynamics: • Products Owners • Developers • Architects • Ops
Our AppDynamics Experience – Root Causes • Memory • Leakage • Over Consumption • Performance Regression • Application Bugs • Architectural Changes • Infrastructure Changes
Our AppDynamics Experience – Methodology • Discard quickly wrong hypotheses=> wide spectrum investigation • Investigate deeper interesting ones • Once under control, create alerts and dashboards • Communicate the methodology to the team
Commons Issues : Response Time • Analyze functionality on cluster / noderesponse time node mean response time cluster mean response time
Commons Issues : Response Time • Analyze functionality by Business Transaction BT mean response time All BT mean response time
Commons Issues : Response Time • related to a resourceconsumed by the application (databases, webservices, …) • related to a performance regressionimplementation • requestsnapshot & drill down functionality
Commons Issues : Response Time Analyze functionality on CPU GC Time Spent / mn (ms) vs CPU Time Spent / mn (ms) CPU ms / mn x100 (but depends on your code) GC CPU ms / mn
Commons Issues : Response Time • related to GarbageCollectingOverActivity/!\memoryproblem • Analyze functionality on GC Time Spent / mn (ms) memory used GC Time Spent / mn (ms)
Commons Issues : Response Time • related to a resourceleak (CPU, FD, …) • related to a selfishprocessthatdries server resources (CPU, Thread, FD) • Analyze functionality • Then class/methodfound by Thread Dump • Or ps, vmstat, top Nb of thread
Commons Issues : Errors /!\ errors do not mean broken user experience meteo is broken
Commons Issues : Errors • Identify the errorkind and the business transactions • Troubleshoot > Error rates, thenchoose the error class that has a drop in number
Commons Issues : Errors • Identify the errorkind and the business transactions • Troubleshoot > Error rates > details
Commons Issues : Memory Memory Problem • Monitor > Application Infrastructure > Memory
Commons Issues : Memory • Memory leak, look at Tenured Gen Behavior
Commons Issues : Memory Then, investigate Object Instance Tracking
Commons Issues : Memory • Memory overconsumption, look at Eden Space
Commons Issues : Memory Then, investigate Object Instance Tracking (again)
Commons Issues : Memory • But sometimes, your VM needs only more memory • Why ?Ask the developers. They should know (?)
Commons Issues : Backend • C process • Mysql backend
Commons Issues : Backend How to monitor a legacy C socket process ? • Get minimal info and set alert from the consumer process
Commons Issues : Backend We have a problem Mean response time
Commons Issues : Backend Timeout not normal behaviorContact the editor Max response time Mean response time
Commons Issues : Backend Another version New version Editor forces us to stop monitoring Mean response time
Alerts & Dashboards : proactive detection • Reduce Mean Time Detection NOC Dashboard > Health status on critical Business Transaction NOC Dashboard
Alerts & Dashboards : proactive detection • Alerts (ops & devs) : • on response time • on err/mn • on stall Application Health Alerts Criteria
Alerts & Dashboards : simplify resolution • reduce Mean Time Resolution Application Health Dashboard • cluster response time • node response time • node error rate • node call number Application Health Dashboard
Alerts & Dashboards : simplify resolution • reduce Mean Time Resolution Infrastructure Health Dashboard • node memory usage • node CPU usage • node Thread number Infrastructure Health Dashboard
Weekly Review Alerting is fine BUT some regressions may not be detected response time degradation on 4 weeks
Weekly Review Our Dashboard Safety Belt • Weekly Performance Review • Weekly Error Review (coming soon) Weekly Performance Dashboard
Capacity planning How to ease : • software tuning • hardware renew • Event planning