Performance and Capacity with Analytics

Performance and Capacity with Analytics Dan Kimball – Cloud Infrastructure Architect - VMware

Agenda • Introduction • What is Analytics? • Real-world examples • 3rd generation monitoring with analytics • Success stories • Bringing it all together • Closing remarks and Q&A

What is Analytics? Analytics is the application of computer technology, operational research, and statistics to solve problems in business and industry. A simple definition of analytics is "the science of analysis". A practical definition, however, would be that analytics is the process of developing optimal or realistic decision recommendations based on insights derived through the application of statistical models and analysis against existing and/or simulated future data. Source: Wikipedia - http://en.wikipedia.org/wiki/Analytics

Real-world examples of Analytics • Clinical decision support systems • Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions, like diabetes, asthma, heart disease, and other lifetime illnesses. • Customer retention • With the number of competing services available, businesses need to focus efforts on maintaining continuous consumer satisfaction, rewarding consumer loyalty and minimizing customer attrition. • Fraud detection • Fraud is a big problem for many businesses and can be of various types: inaccurate credit applications, fraudulent transactions (both offline and online), identity thefts and false insurance claims. • Risk management • When employing risk management techniques, the results are always to predict and benefit from a future scenario. The Capital asset pricing model (CAP-M) "predicts" the best portfolio to maximize return • Underwriting • Many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk. For example, auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver.

“1st Generation” Tools, Up/down… Floods of alerts 1st Generation - Event-Centric, Hard-Threshold Based 3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/an/a 3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System 3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System 3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System 3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/an/a 3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/an/a 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle 3/4/08 14:40 n/a responseTimeServ… The Response Time Service Level on Siebel Sa.. n/a n/an/a 3/4/08 14:20 n/a processingTimeServ.. The Processing Time Service Level on Siebel S. n/a n/an/a 3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 6780)’: is cons.. n/a 0 Windows_System 3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 7940)’: is cons.. n/a 0 Windows_System 3/4/08 14:15 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/an/a 3/4/08 14:15 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/an/a 3/4/08 13:55 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/an/a 3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System 3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System 3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System 3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/an/a 3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/an/a 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle DATA FEEDS DATA FEEDS DATA FEEDS DATA FEEDS

“2nd Generation” Tools, don’t handle change > false positives 2nd Generation - Rudimentary Baselining, Rules/Templates, Charting 3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a 3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System 3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System 3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System 3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a 3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle 3/4/08 14:40 n/a responseTimeServ… The Response Time Service Level on Siebel Sa.. n/a n/a n/a 3/4/08 14:20 n/a processingTimeServ.. The Processing Time Service Level on Siebel S. n/a n/a n/a 3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 6780)’: is cons.. n/a 0 Windows_System 3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 7940)’: is cons.. n/a 0 Windows_System 3/4/08 14:15 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a 3/4/08 14:15 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a 3/4/08 13:55Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:45Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a 3/4/08 16:45Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System 3/4/08 16:44Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System 3/4/08 16:30Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System 3/4/08 16:25n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a 3/4/08 16:20n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a 3/4/08 16:08Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:08Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle

3rd generation monitoring with analytics – It’s here! Dan Kimball – Cloud Infrastructure Architect - COE - VMware

Real-Time Performance Management 3rd Generation – Holistic, Real Time Analytics Flexible INTEGRATIONto many data sources EnterpriseSCALABILITY I can put all my monitoring tools to good use and get better performance analytics. Patented performanceANALYTICS Powerful informationDASHBOARDS

Smart Alert™ - Using Analytics to understand abnormalities across the application App Data (e.g., Hyperic, SCOM) User Experience (e.g., HP RUM, etc.) Business Application Smart Alert Generation (“When”) ! SMART ALERT vCenter(Private/Public Cloud) Network Data (e.g., Ionix IPAM/PM, etc.) Storage (EMC, NetApp, IBM)

Future State – Evolution of Learning and Predictive Analysis Monitoring Server O/S Metrics – CPU, RAM, Disk, I/O, etc. Monitoring App Layer Metric – JVM, DB Connections, etc. Monitoring Business Metrics • My brain is understanding the health of my body. Should I do anything? • Your Brain Understands Context: • If my heart rate and temperature are increasing I should go to the hospital • If I’m tired, rest more • If I tire easily, start exercising! Slide 10 Muscular Skeletal Cardio Vascular Nervous Respiration Heart Rate Temperature Monitoring UserEx Metrics • vCenter Operations is understanding the health of my enterpriseby analyzing millions of measurements. Should I do anything? • vCenter Operations Understands Context: • Act based on urgency of emerging problems • Act based on real-time performance dashboards • Act based on long term correlations and trends

Data Agnostic Approach to Data Collection • Accepts any time series data (examples) • Server OS • Server App layer (i.e., IIS, Oracle, WebSphere, etc.) • Network • Storage • User Experience • Transactional • Business Data • Change Events • Minimal Required Fields (4) • Object Name, Metric Name, Value, Timestamp • Data Extraction - *not* an analytic question • No rules/templates to Write and Maintain • No thresholds or KPI’s to figure out

Learn Normal Behavior and Identify Abnormalities • Doesn’t assume IT data has a normal bell-shaped distribution • Sophisticated Analytics – 9 different algorithms working together • Learns your dynamic ranges of “Normal” without templates • Learns patterns of behavior and identifies abnormalities GRAY BAR Upper and Lower band of Dynamic Threshold - “Normal” BLUE LINE Metric’s Current Value RED BAR Breached Dynamic Threshold – “Abnormal”

Actual Build Standard Build New Build Understanding Progressive Change • Type: Unplanned, Uncontrolled • User Changes • Unapproved Admin Change • Exploits • Shadow IT • Origin: End Users, Developers, Suppliers 80,000 CIs • Type: Planned, Controlled • Updates and fixes • Infrastructure changes • Component patches

Use Cases Dan Kimball – Cloud Infrastructure Architect - VMware

The Role of Operations Management Ensure and RestoreService Levels Optimize forEfficiency and Cost Utilization / forecast Slow performance ! Problem Maintenance Reclaim capacity Rollback change Config issue Orchestrate changes Reactive Proactive

Business benefits delivered by 3rd generation monitoring ComprehensiveVisibility IntelligentAutomation ProactiveManagement vCenter Operations Management Suite • Higher QoS • Fewer Incidents • Tool Consolidation • Compliance • Faster MTTR • Improved Collaboration • Resource Utilization • … “Troubleshooting time reduced by 50%” “Notified the storage team before they were even aware of an issue.” “We’ll be able to reduce our monitoring tools from over 300 to about 30.” TUI Infotec Maximus Kaiser Permanente

Customer Success: IT Operations Solve performance issues before end-users are affected and reduce total alerts • Before • 400 critical alerts/hour • End-user complaints alerted IT to the problem • End-users impacted (avg. 2 hours/outage) • 12 Level-2 engineers on bridge call to address problem • After • 20 alerts/MONTH • 3 hours advanced warning of slowdown w/root cause • NO end-user impact • 1 Level-2 Engineer and 1 DBA to address problems Learn Normal Smart Alerting Root Cause

Bringing it all together Dan Kimball – Cloud Infrastructure Architect - COE - VMware

Focused Solutions • Performance and Capacity analytics with root cause analysis • Configuration, Change, Compliance Management with Patching • Application Dependency Mapping

Change Events Correlated with Health and Performance

Deeper performance and capacity management for the Cloud • Overview • Gain performance and capacity management across the Enterprise • Cover every silo of the environment • Breakdown the silos in the org. • Reduce overall MTTR/MTTI • Keep an eye on your cloud service providers • Reclaim precious compute resources • Gain unprecedented visibility into how your infrastructure behaves Service Owner

Performance and Capacity for VDI • Overview • End-to-end monitoring of infrastructure • Included PCoIPperformance monitoring • Desktop, Pool and User Contexts • Self-Learning performance analytics • Automated alerts • Remediation guidance • Benefits • Get to root cause quickly; Reduce MTTI • Respond proactively before support calls • Remediate quickly and accurately • Improve resource utilization by identifying over-provisioned hardware and track down bottlenecks

Thank you for your time! Additional reading material: Quantifying Information Data Loss through Data Aggregation http://www.vmware.com/files/pdf/vcenter/VMware-vCenter-Operations-Quantifying-Information-Loss-Data-Aggregation-WP-EN.pdf How Normal is Your Data: http://www.vmware.com/files/pdf/vcenter/VMware-vCenter-Operations-How-Normal-Is-Your-Data-WP-EN.pdf

Performance and Capacity with Analytics