590 likes | 763 Views
vCenter Operations. Technical Discussion, May 2011. VCAP-DCD. Iwan ‘e1’ Rahabok Senior Systems Consultant e1@vmware.com | virtual-red-dot.blogspot.com | 9119-9226. Introduction. Application Management App Release + Performance. 5. IT Service Management
E N D
vCenter Operations • Technical Discussion, May 2011 VCAP-DCD Iwan ‘e1’ Rahabok Senior Systems Consultant e1@vmware.com | virtual-red-dot.blogspot.com | 9119-9226
Application Management • App Release + Performance 5. • IT Service Management • Problem, incident, change, config 4. • Security & Compliance • vShield + VCM: Operational and Regulatory Compliance Including physical management 3. • Private Cloud Self-Service • Solution Bundle: IaaS 2. • Infrastructure & Operations • Performance, Capacity, Configuration 1.
>< Automation Orchestration More Engineering More Management
Management Challenges • Performance problems often occur with no real warning • Many times end users are the first to notice problems • Root cause determination is difficult and time-consuming • Solving problems requires all-hands-on-deck bridge calls • Real-time understanding of performance is lacking • No reliable understanding of the health of IT infrastructure makes IT too reactive • Siloed monitoring tools do not allow a common “truth” • No correlation across IT silos • Optimizing IT infrastructure is difficult if not impossible • Understanding the abnormal metric behaviors that lead to degradation of Key Performance Indicators is not possible with current tools • Understanding the abnormal behaviors that define your worst performing devices is not possible with current tools • Heavy reliance on “Tribal Knowledge” of a few application experts
What If You Could… • Automate • Eliminate time-consuming problem resolution processes • Correlate and Accelerate • “One Click” to root cause of emerging performance problems to reduce MTTI/MTTR • Get Proactive • Avert end user and business impact of building performance problems • Collaborate • Aggregate and correlate data from monitoring landscape to create a single “truth” • Optimize • Tune components to deliver optimal performance for application transactions
vCenter Operations Enterprise + Configuration & Compliance Management (vCenter Configuration Manager) + Other VMware & 3rd Party Integrations (View, management, servers, storage) vCenter Operations Advanced Capacity Management vCenter Operations Standard Performance Management (up to 1500 VM) vCenter VMware Cloud / vCenter Non-Vmware (incl. physical) environments
Patented Performance Analytics • Self-learning of “normal” performance conditions • Service health baseline and trending • Smart alerts of impending performance degradation • Purpose Built Capacity Planning & Analysis • Integrated capacity analysis and forecasting • Decision support & automation via views, alerts, reports • VM right sizing and capacity reclamation • Automated Configuration & Compliance • Automated Patching and Provisioning • Comprehensive change tracking to isolate root cause • Single-click rollback to remediate and return to normal
Comparing the Editions Scope Function
Demo • Familiarisation of UI • Infrastructure and Analysis • Concepts • Workload • Health • Capacity
vCenter Environment - Workload • Workload Measures • Demand for resources vs. Resources currently used • Result is a percentage of Workload • Low number is Good – Object has the resources it needs • Can go above 100% - Object is “Starving” • Workload summarized across critical resources • CPU • Storage • Network • Memory • Workload Details View • View the state of the Peer and Parent Objects and troubleshoot • Am I a victim or a villain? • Is this a population problem?
vCenter Environment - Health • Health Measures • How normal is this object behaving: • 0-100 (Higher is Healthier or Normal) • Learns dynamic ranges of “Normal” for each metric • Learns patterns of behavior and identifies metric abnormalities • Healthy = no abnormalities • Health and Workload together • Health High and Workload High – Normal Behavior for this timeframe • Health High and Workload Low – Normal Behavior for this timeframe • Health Low and Workload High – Something is amiss! Perfomance spike • Health Low and Workload Low – Something is amiss. Demand drops Important Note Low Health does not imply a problem. It tells you that the object is acting differently than normal.
Learn Normal Behavior and Identify Abnormalities • Doesn’t assume IT data has a normal bell-shaped distribution • Sophisticated Analytics – 8 different algorithms • Learns your dynamic ranges of “Normal” without templates • Learns patterns of behavior and identifies Abnormalities GRAY BAR Upper and Lower band of Dynamic Threshold - “Normal” BLUE LINE Metric’s Current Value RED BAR Breached Dynamic Threshold – “Abnormal”
vCenter Environment - Capacity • Capacity • How much time before Capacity run out? • 0-100: Higher number, longer time. • Thresholds User Configurable • 30 Days Left = RED • 60 Days Left = Orange • Etc. • Unlike Workload, Capacity is long-term. • Capacity measured for critical resources • CPU, RAM, Storage, Network • Capacity Details View • Shows the chart and trend for each of the above resources • Denotes current state • Projected breach point and days left
Health (Deviation) • Green square: 76–100. • The health of the object is normal. No attention required. • Yellow square: 51–75. • The object is experiencing some level of issues. You must check and take appropriate action. • Orange square: 26–50. • The object might have serious issues. You must check and take appropriate action as soon as possible. • Red square: 0–25. • The object is either not functioning properly or will stop functioning soon. You must take an action immediately. • Blue square: • No data is available for any of the metrics for the time period. • Gray square: • The object is offline.
Workload • Green circle: 0- 84. • There is no excessive workload on the object. No attention required. • Yellow circle: 85–94. • The object is experiencing some high resource workloads. • Orange circle: 95–99. • Workload on the object is approaching its capacity in at least one area. • Red circle: 100 or more. • Workload on the object is at or over its capacity in one or more areas. • The number 85 and 95 are shown as Green and Yellow lines in the Events chart.
Capacity • Green cube: 26-100. • The object is not expected to reach its capacity limits within the next 120 days. • Yellow cube: 16–25. • In 60 - 120 days. • Orange cube: 6–15. • In 30 - 60 days. • Red cube: 0–5. • In < 30 days. • The number 5, 15 and 25 are shown as colored lines in the Events chart.
Performance Visibility Across the Virtualized Datacenter Aggregates 100s of metrics into 1 intelligent score Full visibility up and down the datacenter stack Drill into ESX server for further details
Intuitive, Web RIA-based user-friendly interface Search and filter Breadcumbs to track object hierarchy Context sensitive object hierarchy
Continuous, automatic learning of normal behavior for key metrics Workload issue correlated to net I/O constraints Quickly show Reservation vs Demand vs Usage
Drilldown to track changes Diagnostics relative to parent, peer and child objects Detailed display of events and health score changes
Visibility into Disk and Network IO performance Network statistics for every NIC Disk subsystem performance details by datastores and LUNs Quiz: what’s the difference between Total & Host?
Quickly identify “suspect” performance metric KPI history with timestamp to indicate root cause
Capacity • Estimating the of days left • Score is 0-100. Non linear. 10 doesnot mean 10 days left. • CapacityIQ value add: • What-If analysis • Discovery of over-allocated and under-allocated VM • Reporting • A Capacity-centric dashboard
Health tree with topology mapping Top-down visibility into health changes Time-series charts for individual metric
Individual performance metric details Single view that correlates multiple metrics Detailed list of all metrics indicating smart alerts
Visualisation quickly pinpoints hotspots Single click drill down for further details
Storage • Since all the datastores are on the same array, how do we quickly tell the relative workload generated by every one of them? • For each of these datastores, how do we know the relative workload generated by the VM? • For every VM, how do we know the latency is within reasonable number? • How do we show all the above data in “one chart”, without the need to show a lot of numbers?
vCenter Operations Standard Architecture • Four Main Services: Collector, Analytics, Web, ActiveMQ • Bundled DB: • PostgresSQL DB • File-based DB (FSDB) for raw metric storage • Single Collector for vCenter. Embedded in appliance
vCenter Operations Standard Processing 3: Incoming data points are tested against Dynamic Threshold bands and used to calculate Health, Workload and Capacity 2a: Analytics runs daily to determine hour-by-hour Dynamic Thresholds for next 24 hours 4: Results provided to UI: Update “Badges”, provide Root Cause for Health scores, etc. 1a: vCenter Collector collects metrics, topology & change events from vCenter - Ongoing - 2c: Store metric Dynamic Thresholds data in PostgresSQL DB 1b: Data stored in FSDB 2b: Full FSDB is scanned by the analytic algorithms to determine per metric best match the next 24 hour period
vCenter Operations Data Agnostic Approach to Data Collection • Accepts any time series data (examples) • Server OS • Server App layer (eg, IIS, Oracle, WebSphere, etc) • Network • Storage • User Experience • Transactional • Business Data • Change Events • Minimal Required Fields (4) • Object Name, Metric Name, Value, Timestamp • Data Extraction - *not* an analytic question • No rules/templates to Write and Maintain • vCenter Operations Analytics do all of the “Work”
Learn Normal Behavior and Identify Abnormalities Slide 39 • Doesn’t assume IT data has a normal bell-shaped distribution • Sophisticated Analytics – 8 different algorithms • Learns your dynamic ranges of “Normal” without templates • Learns patterns of behavior and identifies Abnormalities GRAY BAR Learned Upper and Lower band of Dynamic Threshold - “Normal” BLUE LINE Metric’s Measured Value RED Zone Breached Dynamic Threshold – “Abnormal”
Dynamic Threshold Algorithms Dynamic Thresholds are the Cornerstone to all other forms of vC Ops Ent - Stand Alone Analytics • Understand the normal behavior of any time-series metric • Eight (8) distinct algorithms each determine an upper and lower ‘band’ – results of each algorithm compete to ‘win’ to represent the ‘best choice’ • vC Ops Ent - Stand Alone detects metric-level abnormalities for use in: • Generation of Smart Alerts • Visualizing real-time ‘Health’ • Revealing hidden relationships • etc. * Figure shows a performance metric (blue line), its normal behavior (gray zone), and when it’s behaving abnormally (red area)
Proactive Alerting – Smart Alerts App Data (eg, Wily, etc.) User Experience (eg, RUM, etc.) Business Application Smart Alert Generation (“When”) ! SMART ALERT Database Silo (eg, Quest, etc.) Network Data (e.g., Ionix IPPM, etc.) Business Data (eg, Finance)
Smart Alert Trigger • vC Ops Ent - Stand Alone tracks aggregate amount of abnormality and alerts when “explosion” is detected, or when a ‘high water mark’ is detected • Intrinsically observed that performance problems are first seen at the metric level when metrics begin to behave abnormally • Blue shaded region represents the number of metrics for an application (represented by a set of servers/devices) that are at any given time measured abnormally • The Red line represents an Analytically determined ever-changing level at which vC Ops Ent - Stand Alone determines a performance warning is warranted – a Smart Alert is triggered
Smart Alert Summary (“What”) Root cause technology tier is the DB Metric-level root cause symptoms - START HERE Impact analysis shows the health of the application as well as the health of the tiers that comprise the application Root-Cause ranks the tiers in order of priority and within those tiers shows the most affected metrics and resources
Drill down to the Root Cause Smart Alert Summary (“What”) Early Warning SMART ALERT Noise Line Crossed
Drill down to the Root Cause Smart Alert Summary (“What”) Impact to application health Impact to health of each technology tier No major impact to application key Performance Indicators (KPIs)…yet.
Drill down to the Root Cause Smart Alert Summary (“What”) See change and other external events affect on application health with this “mash up” view
Tracking disparate “Resources” from various technology silos Alert only when applications need attention Learning behaviour analytically Determine performance Health
Performance Visibility Across the Virtualized Datacenter DB is Root Cause tier START HERE! Symptoms Proactive Alert Impact to health to each technology tier Application Health KPIs are outside of normal level but not breached SLAs
Performance Visibility Across the Entire Datacenter Application Owner View - Application health view with active alerts and tier health
Dynamic Performance Dashboards – Application Owner Views Application health view with active alerts and tier health Heat Maps allow you to see the Health of hundreds of objects at once. Health and Alerts broken down by Tier and Objects