940 likes | 1.32k Views
Performance Management with Free and Bundled Tools. Adrian Cockcroft Netflix Inc. acockcroft@netflix.com (Co-authored with Mario Jauvin MFJ Associates mario@mfjassociates.net) 9 June, 2014. Agenda. Overview of Capacity Planning Requirements and Data Sources Performance Data Collection
E N D
Performance Management with Free and Bundled Tools Adrian Cockcroft Netflix Inc. acockcroft@netflix.com (Co-authored with Mario Jauvin MFJ Associates mario@mfjassociates.net) 9 June, 2014
Agenda • Overview of Capacity Planning Requirements and Data Sources • Performance Data Collection • Free Network Monitoring Tools • Free System Monitoring Tools • Free Load Generation and Modelling Tools • Licences and References Adrian Cockcroft and Mario Jauvin
FREE!! What are we talking about? QA Load generation with Grinder or SLAMD, modelling with PDQ and R Network monitoring with WireShark, MRTG, BigSister, Cacti, Nagios, OpenNMS, Zenoss, Openxtra, ntop Application Tier monitoring with Orca, Cacti, BigSister, Ganglia, XEtoolkit Database Tier monitoring With SEtoolkit, Orca, XEtoolkit Adrian Cockcroft and Mario Jauvin
Capacity Planning Requirements and Data Sources Adrian Cockcroft and Mario Jauvin
Definitions • Capacity • Resource utilization and headroom • Planning • Predicting future needs by analyzing historical data and modeling future scenarios • Performance Monitoring • Collecting and reporting on performance data • Free Tools • Bundled with the OS or available for no $$$ Adrian Cockcroft and Mario Jauvin
Capacity Planning Requirements • We care about CPU, Memory, Network and Disk resources, and Application response times • We need to know how much of each resource we are using now, and will use in the future • We need to know how much headroom we have to handle higher loads • We want to understand how headroom varies, and how it relates to application response times and throughput Adrian Cockcroft and Mario Jauvin
CPU Capacity Measurements • CPU Capacity is defined by CPU type and clock rate, or a benchmark rating like SPECrateInt2000 • CPU utilization is defined as busy time divided by elapsed time for each CPU • CPU load average measures the average number of jobs running and ready to run Adrian Cockcroft and Mario Jauvin
Memory Capacity Measurements • Physical Memory Capacity Utilization and Limits • Kernel memory • Shared Memory segment • Executable code, stack and heap • File system cache usage • Unused free memory • Virtual Memory Capacity - Swap Space • Memory Throughput • Page in and page out rates Adrian Cockcroft and Mario Jauvin
Network Capacity Measurements • Network Interface Throughput • Byte and packet rates input and output • TCP Protocol Specific Throughput • TCP connection count and connection rates • TCP byte rates input and output • NFS/SMB Protocol Specific Throughput • Byte rates read and write • NFS/SMB service response times • HTTP Protocol Specific Throughput • HTTP operation rates • Get and put payload byte rates and size distribution Adrian Cockcroft and Mario Jauvin
Disk Capacity Measurements • Detailed metrics vary by platform • Easy for the simple disk cases • Hard for cached RAID subsystems • Almost Impossible for shared disk subsystems and SANs • Another system or volume can be sharing a backend spindle, when it gets busy your own volume can saturate, even though you did not change your own workload Adrian Cockcroft and Mario Jauvin
Capacity Planning Challenges • Constantly changing infrastructure • Limited attention span from staff • Horizontally scaled commodity systems • Per node software licencing costs too much • Too many tools, too many agents per node • Too much data, not enough analysis • Non-linear and non-intuitive scalability • Lack of tools and metrics for virtualized resources Adrian Cockcroft and Mario Jauvin
Observability • Four different viewpoints • Management • Engineering • QA Testing • Operations • Each needs very different information • Ideal would be different views of the same performance database • Reality is a mess of disjoint tools Adrian Cockcroft and Mario Jauvin
Management Viewpoint • Daily summary of status and problems • Business oriented metrics • Future scenario planning • Marketing and management input • Concise report with dashboard style status indicators • Free tools: R, Spreadsheet and Web based displays, no good summarization tools Adrian Cockcroft and Mario Jauvin
Engineering Viewpoint • Large volumes of detailed data at several different time scales • Input to tuning, reconfiguring and future product development • Low level problem diagnosis • Detailed reports with drill down and correlation analysis • Free tools: XE/SE Toolkit, Orca, Ganglia, Cacti, R Adrian Cockcroft and Mario Jauvin
QA Test Viewpoint • Workload specification tools • Load generation frameworks • Testing for functionality and performance • Regression tools to compare releases • Modelling difference between test configuration and production configuration • Free Tools: The Grinder, SLAMD, R, PDQ Adrian Cockcroft and Mario Jauvin
Operations Viewpoint • Immediate timeframe • Real time display, updated in seconds • Alert based monitoring • High level problem diagnosis • Simple high level graphs and views • Free tools: BigSister, Nagios, OpenNMS, MRTG, Cacti, Ganglia, WireShark, ntop Adrian Cockcroft and Mario Jauvin
Measurement Data Interfaces • Several generic raw access methods • Read the kernel directly (not a good idea) • Structured system data (Solaris kstat, Linux /proc) • Process data • Network data • Accounting data • Application data • Command based data interfaces • Scrape data from vmstat, iostat, netstat, sar, ps • Higher overhead, lower resolution, missing metrics • Data available is platform specific either way • Much more detail on this topic in the Solaris/Linux Performance Measurement and Tuning Class Adrian Cockcroft and Mario Jauvin
Free Network Monitoring Tools Adrian Cockcroft and Mario Jauvin
SNMP • Simple network management protocol • UDP protocol based on port 161 • Client/server like • Client is called management application entity • Server is called an agent entity • Agent entity is designed to be implemented on network hardware, router, switches, etc Adrian Cockcroft and Mario Jauvin
SNMP – MIBs • Management information base • Defines the structure and the semantic of the information that can be reported on • Most commonly used is MIB-II which defines a set of standard networking attributes • Interface tables • System level information • Routing tables • Specified using ASN.1 (abstract syntax notation 1) Adrian Cockcroft and Mario Jauvin
SNMP – commands • Called PDU (protocol data units) • GET • GETNEXT • GETBULK • SET • Encoded using BER (basic encoding rules) Adrian Cockcroft and Mario Jauvin
Versions • Version 1, original version done in May 1991 • Version 2, around 1993. Failed because the IETF credo of “rough consensus and running code” could not be met on securing SNMP • Turned into V2c for community string security (like V1) • Version 3, added security and complexity in 1998 Adrian Cockcroft and Mario Jauvin
SNMP tools • Too numerous to name all but… • OpenNMS • Nagios • Cacti • MRTG • Net-snmp • See www.snmplink.org Adrian Cockcroft and Mario Jauvin
SNMP tools • Snmpwalk – will report all data in a specified MIB • getIf – will report data about interfaces and includes built-in MIB browser • Snmptable – will report tabular data from MIB tables Adrian Cockcroft and Mario Jauvin
OpenNMS • Well…. it’s not that portable • 95% java is not 100% java • Requires about 20-30 different platform specific packages (PostgreSQL, Perl, RRD tool, Tomcat 4 etc…) • Difficult to install • Easy auto discovery • Web-based interface Adrian Cockcroft and Mario Jauvin
OpenNMS • Main screen shot Adrian Cockcroft and Mario Jauvin
OpenNMS • Node screen shot Adrian Cockcroft and Mario Jauvin
Nagios • Easy to build/compile (on Solaris 10) • Easy to install • Quick response from CGI • Configuration is manual and a pain • 13 configuration files with all kinds of interrelated entries • Tedious and error prone • Requires plugins to do anything Adrian Cockcroft and Mario Jauvin
Nagios • Main screen shot Adrian Cockcroft and Mario Jauvin
Nagios • Host detail screen shot Adrian Cockcroft and Mario Jauvin
ntop • Similar to familiar UNIX top tool for processes but used for network • Provide huge selection of real-time data • Can be found at http://www.openxtra.co.uk/ Adrian Cockcroft and Mario Jauvin
ntop – Active Sessions Adrian Cockcroft and Mario Jauvin
ntop Hosts Adrian Cockcroft and Mario Jauvin
ntop Network Load Adrian Cockcroft and Mario Jauvin
ntop_Network_Thruput Adrian Cockcroft and Mario Jauvin
ntop Port Dist Adrian Cockcroft and Mario Jauvin
ntop_Protocol_Dist Adrian Cockcroft and Mario Jauvin
ntop Protocols Adrian Cockcroft and Mario Jauvin
Zenoss • Open source monitoring and management of IT infrastructure • Zenoss core is free • Other editions are for a fee • Get it from http://www.zenoss.com/download/ Adrian Cockcroft and Mario Jauvin
zenoss Architecture Adrian Cockcroft and Mario Jauvin
zenoss Dash Config Adrian Cockcroft and Mario Jauvin
zenoss Google Adrian Cockcroft and Mario Jauvin
zenoss Google Alerts Adrian Cockcroft and Mario Jauvin
Zenoss Graphs Adrian Cockcroft and Mario Jauvin
zenoss Topology Adrian Cockcroft and Mario Jauvin
MRTG • Really simple to install and configure • Require manual config file creation • Only for MIB-II interface plotting out of the box • Graphing not flexible, axis, time etc Adrian Cockcroft and Mario Jauvin
MRTG • Interface screen shot Adrian Cockcroft and Mario Jauvin
MRTG • Other CPU screen shot Adrian Cockcroft and Mario Jauvin
RRD tool • Software to store, retrieve and graph numerical time series data • Use a round robin algorithm • Data files are a fixed size • Don’t grow • Don’t require maintenance Adrian Cockcroft and Mario Jauvin