490 likes | 524 Views
Introduction to Time Series Database for Network Engineer. CHINOG, May 23th 2019. Damien Garros , Network Reliability Engineer / Roblox. Twitter @damgarros Github @dgarros. What is Roblox ?. Educational platform for young software developers Gaming and Social platform
E N D
Introduction to Time Series Database for Network Engineer CHINOG, May 23th 2019 Damien Garros, Network Reliability Engineer / Roblox Twitter @damgarros Github @dgarros
What is Roblox ? • Educational platform for young software developers • Gaming and Social platform • Core audience for player is kids ages 9-12 • 2 Million Active developers • 90+ Million monthly active users
Agenda • Why do we need to learn that ? • Where are we today ? • Introduction to Time Series Database • Monitoring Stack @ Roblox • Q & A
1 Why do we need to learn that ?
DATA IS THE NEW OIL FIND it EXTRACT it REFINE it MONETIZE it
As Network Engineer, we are sitting on a lot of data But we don’t have the right tools
EXTRACT & REFINE • How much data are you currently extracting from your network • How fast can you extract new data ?
“MONETIZE” network data • Reduce time to root cause, for everyone in our org(no it’s not the network) • Improve reliability • Enable new applications/system/use cases • Increase the value of the network to the organization
2 Where are we today ?
Legacy Network Monitoring Solution NMS Logs SNMP (Pull) Server(All in One) Devices
RRD Tools • Introduced in 1999 • Storage • Aggregation • Visualization No query engine Data retention is poor.
RRD Tools - Down Sampling 1 Day 1 Week 1 Month
Telemetry has been a hot topic in the network industry Telemetry Streaming Kill SNMP Openconfig gNMI
... Network Monitoring Solution Telemetry Steaming ??? Devices Transport
PULL PUSH Streaming SNMP
What are other doing outside of the Network Industry ?? Push andPull MetricsStore Each components can scale-out independently The storage and visualization are decoupled. Store once, visualize as required Agent Visualize Alert Agent Logs Visualize Agent Alert Agent
Datastore specialized by data format Metrics..Time Series LogsEvents Structured Data Numeric value evolving over time Constant Interval Counters CPU Number peers Mostly Text data Unpredictable interval Routing/Forwarding Table Configuration
Open source projects Monitoring / Alerting CollectorAgent Time Series Database Alerting Visualization Kapacitor Elastalert
Telegraf - The Swiss Army Knife • Plugins driven agent / Extensible • Support out of the box • Over 80 Input Plugins • Most Databases (output) • Data manipulation • SNMP Input Plugin • Juniper / OpenConfig
Cloud Based Solutions MetricsStore Agent Visualize Alert Agent Agent Agent
Reuse the same components for network devices Agent Linux Based NOS Store Agent Visualize SNMP Legacy NOS Alert Agent Legacy NetconfeAPI NOS Custom Collector Collector Streaming Telemetry Enabled Collector
3 Time Series Database
Modern Time Series Database • New generation of database optimized for Time serie data • Started around 2013, Mainstream since 2016 • Powerful query engine • Decorelate storage and visualization
Introduction to Modern TSDB interface_output_bytes{device="spine1",interface="et-0/0/4"} 4569765412 measurement nameWhat is it ? Tags/LabelsContextual information Value
Introduction to Modern TSDB interface_output_bytes{device="br1-fra1"}
Introduction to Modern TSDB interface_output_bytes{device="br1-fra1"}
Introduction to Modern TSDB deriv(interface_output_bytes{device="br1-fra1"}[5m])*8
Introduction to Modern TSDB sumby(device)( deriv(interface_output_bytes{device=~"br.*"} [5m]))
Introduction to Modern TSDB deriv(interface_output_bytes{device="br1-fra1"} [5m]) / interface_speed{device="br1-fra1"}
Introduction to Modern TSDB interface_output_bytes{device="spine1",interface="et-0/0/4"} 4569765412 interface_output_bytes{device="spine1",interface="et-0/0/4", role="leaf",site="fra1",provider="level3", intf_role="uplink"}
Introduction to Modern TSDB sumby(provider)( deriv(interface_output_bytes{device="br1-fra1"} [5m]))
4 Monitoring @ Roblox
Network Monitoring / Alerting @ Roblox Created a Collector based on Netconf Created an Alert Manager Visualize Collect Netconf Alert
Collector - py-metric-collector • Dynamic Inventory • Dynamic Tagging • Sharding • Enum for State • Support Junos and F5 • Support Multiple database https://github.com/dgarros/py-metric-collector
Network Monitoring / Alerting Stack Visualize Collect Netconf Alert Get list of devices Get contextual information (role, site etc..) Topology information Device status IP to interface mapping Source of Truth
REFINE - Add more contextual data at runtime All • Device Role • Site • Service Group • Junos Version Interfaces • peer_role • Interface_role • Provider • circuit_id • geo_type
Alert Manager • Modular system to ingest alerts from any source • Advanced Suppression rules • Integration with external data sources • Group alerts (interface, bgp) based on topology information https://github.com/mayuresh82/alert_manager