490 likes | 607 Views
A Deep Dive into Nagios Analytics. Alexis Lê-Quôc (@alq) http://datadoghq.com. @alq Dev & Ops Nagios user since 2008 Datadog co-founder. A little survey. Top 3 failed checks. That I responded to last week. That woke me up. That most of my team responded to at least once.
E N D
A Deep Dive into Nagios Analytics • Alexis Lê-Quôc (@alq) • http://datadoghq.com
@alq Dev & Ops Nagios user since 2008 Datadog co-founder
That I responded to last week That woke me up That most of my team responded to at least once That impacts our business the most? That I responded to 5 weeks ago Top 3 failed checks
That I responded to last week That woke me up That most of my team responded to at least once That impacts our business the most? That I responded to 5 weeks ago Top 3 failed checks
At best, finding local optimums At worst, brownian motion Using memory to prioritize remediation...
Performance Metrics Nagios Traffic Other Sources In the “Cloud”
Almost 13000 Nagios “events” over past week
Not a scientific study A dialog with data
25% 50% 75% 100% 20 93 322 904 Population
Outliers Sick hosts, silenced checks Weekly count per host split by quartile
1-3% of alerts notify Little difference per quartile Notifications
Mean about the same across quartiles Time-based deviation?
Young Old Seldom happens Happens Often
Happen once in a while Occur often, for a long time Tolerated
Awk Postgres R d3 Find out tomorrow!
Take-aways • Don’t rely on your memory • Your Nagios logs are a treasure trove • Have a dialog with your data • Presentation matters