1 / 41

Nagios in the Agile/DevOps World at IMVU, Inc: Insights & Best Practices

Learn how IMVU is achieving Continuous Deployment success with Nagios, embracing DevOps culture and sharing strategies for monitoring and alerting. Discover key practices, tools, and the importance of automated monitoring to drive business decisions and ensure a smooth operational environment. Dive into Nagios testing strategies and the benefits of decoupling for increased scalability and efficiency. Explore the significance of embracing change, running blameless postmortems, and the human element of monitoring and alerting. Join Nagios World Conference for valuable insights and best practices.

scarborough
Download Presentation

Nagios in the Agile/DevOps World at IMVU, Inc: Insights & Best Practices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nagios in the Agile / DevOps / Continuous Deployment World KishoreJalleda Director of Operations IMVU, Inc kjalleda@imvu.com

  2. About IMVU

  3. About IMVU • Avatar based Social Entertainment destination • $50+ Million Annual Revenue • 100+ Million Registered Users • 10+ Million Items in Virtual Catalog

  4. IMVU Engineering and Continuous Deployment Doing the Impossible50 times a day Continuous deployment (CD) is real IMVU has been one of the pioneers of CD DevOps culture is big No approval needed to ship to 1% of customers Check out our engineering blog http://engineering.imvu.com/

  5. What does this mean ? Things change quickly New features add up instantly Can break frequently Failures can cascade rapidly Things can fall through the cracks Many things change at the same time Etc

  6. Nagios World Conference Insights into Nagios @IMVU

  7. Overview Nagios Core 3.2.0 800+ Hosts 18000+ Service Checks Single Nagios Instance 8 cores, 8GB RAM

  8. Server Lifecycle Management

  9. [ Operations ] Continuous Integration and Deployment

  10. IMVU Asset Database ( AssetDB ) Built internally by IMVU Simple but powerful concept Source of truth for everything asset related Has information on Class ( mysql, standard-http-server, redis ) Role ( customer shard, clientdynweb ) Tag (available, no-update ) Attributes (cpu-cores, memory-size, mysql-role ) Much more …

  11. Auto generation of Nagios configuration files #generate_nagios_conf.pl ( most configurations auto generated from AssetDB )

  12. Ops Buildbot ( builds, builders/buildslaves ) # svn commit hosts.cfg hostgroups.cfg

  13. Opspush ( Operations Push System ) # opspush --comment “xxxxxx” –role nagios run “cfagent -v” on the box --use-last-green-rev green check status of “last build” opspush yes red --oncall-override ? No exit

  14. Product Development

  15. Tech Designs & New Nagios Alert Requests

  16. Nagios Alert Request Template

  17. Big Data / De-Sharding Data freshness is critical to help make the right business decisions Nagios used for ETL/DW status and error checking Nagios and Ops embeds can help empower your Data Infrastructure team

  18. Things will FAIL

  19. How we try to prevent and catch failures

  20. Cluster Immune System • Automated push monitoring and rollback ! Push to X% of servers Monitor Critical Metrics Good Push to rest Bad Monitor Critical Metrics Bad Auto Rollback w00t!, my change is Live Good

  21. Don’t just rely on Standard Metrics

  22. Demystifying P1s ( Priority 1 ) P1: Priority 1 issue impacting live operations Phases Identification (Nagios ) Communication and Declaration Resolution Postmortem / 5 Whys / Root Cause Analysis P1 follow up

  23. 5 Why / Postmortem (PM) / Root Cause Analysis 5 Why process Amazing culture of running blameless postmortems New Nagios checks are the most common action Items . A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs

  24. Example “5 Whys” Process

  25. Monitor Business & Application Level Metrics

  26. Monitor Response Times Load Average is a meaningless number 

  27. Continuous Monitoring ( Istatd ) Developed by IMVU Sub 10 sec resolution of data API to get average, SD, min, max sample count for each data point in a graph Ability to stack multiple graphs on the fly Long retention times Releasing as open source this week !!! https://github.com/imvu-open/istatd/wiki

  28. Istatd: 10 Second Resolution of Data

  29. Istatd: Stacking graphs on the fly

  30. Nagios World Conference Have a “Strategy” for Monitoring and Alerting

  31. Our (Nagios) Strategy Human element of Monitoring and Alerting ( Nagios ) Nagios & Test Driven Development ( TDD ) Decouple ( Nagios ) Aggregated Checks

  32. Human Element of Monitoring and Alerting Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ;) Do not let people develop immunity to pages as very soon real issues will be ignored All pages are Actionable policy: If there is no action, it should not be paging Automatic enabling of alerting/notifications for improperly silenced ones. Ownership and accountability of issues/alerts

  33. Daily Triage of Nagios Alerts and Interrupts

  34. Nagios & Test Driven Development (TDD) Write tests for your Nagios Infrastructure Adopted heavily by Ops ( imp to keep pace with eng, DevOps culture is awesome  ) High degree of confidence in pushing changes Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother. Functional testing can still be a challenge

  35. Sample Nagios Test Output

  36. Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model Worker fact Redis fact Reporter fact status Aggregator fact status

  37. Why Decouple ? For scalability and efficiency Our model was higher performing compared to NRPE Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE ) Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton

  38. Nagios World Conference Closing Remarks

  39. Closing Remarks Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless Bathroom breaks can take 5-10 minutes, so don’t fret too much about Nagios performance Build some form of predictive monitoring and alerting to catch and alert on change in trends Invest in configuration automation, validation and compliance Finally, Nagios has been like a Honda, very reliable !!!

  40. Nagios World Conference Questions ?

  41. Thank You !!! kjalleda@imvu.com We are Hiring: imvu.com/jobs Engineering Blog: http://engineering.imvu.com/

More Related