240 likes | 344 Views
How to Manage Data Collection in a Large Environment. Paul K Merline & Mike Badaczewski November 15, 2011. Which is greater…the average attendance at Busch Stadium or the number of servers we collect data on every night?. Answer….. AT&T Systems collected nightly= 38,353 Busch Stadium
E N D
How to Manage Data Collection in a Large Environment Paul K Merline & Mike BadaczewskiNovember 15, 2011
Which is greater…the average attendance at Busch Stadium or the number of servers we collect data on every night?
Answer….. AT&T Systems collected nightly= 38,353 Busch Stadium Average nightly attendance = 38,196 (source ESPN.com A/O 10-18-11)
Data Collection Goals • Provide consistent, standard and meaningful resource usage data for all servers to support Capacity Planning. • Establish and maintain an environment capable of supporting data collection for 40,000 servers with the existing staff. • Have previous day’s data available by 08:00 local time.
Data Collection Overview • Number of metrics collected and retention based upon criticality of server (service levels). • Separate data collection based on platform, e.g. UNIX, Windows, etc. • Spread workload across several centralized data collection servers (Consoles). • Stagger data collection across time zones. • Analyzed data output sent to data base server for Visualizer db loads.
Data Collection Strategy Collect and retain only the metrics necessary based on the criticality of the server • Tier and Tier Level assigned based on: • server criticality (MCA, normal production) • status (production, test, development) • in-service indicator • Service Level assigned based on • Tier and Tier Level which determines: • metrics collected • retention period of metrics
Data Collection Service Levels BRONZE SILVER PLATINUM GOLD
Data Collection Process Servers are grouped into collection domains based on: Service Level Gold Silver Bronze Region East Central Mountain Pacific Alaska Hawaii UNIX Windows VMWare Platform Frame Frames Non-Frames (target is 25 servers per domain for performance reasons)
Data Collection Tools • The BMC Performance Assurance product family offers a complete solution for performance management of UNIX and Windows systems. • It delivers the following critical functions for managing distributed systems: • Real-time monitoring • Modeling and predicting • Graphical performance analysis
BMC Performance Assurance ATT Developed Exception Reporting Database Server Servers (nodes) CPDB Application Reporter • FACT Metric Tables • Hourly • Summarized CPDB Reporting Collect Forecasting (bi-annual planning) BMC Visualizer BMC Perceiver Analyze (detailed analysis) (web-based report viewing) Console Servers Analyst Console BMC Predict BMC Investigate (real-time analysis) (modeling)
BMC Consoles and Visualizer Database Visualizer Database Schemas Windows UNIX VMWare East Central Pacific East Central Pacific East Central Pacific Gold – 1 Gold – 1 Gold – 1 Gold – 6 Silver – 8 Bronze - 6 Gold – 1 Silver – 5 Bronze - 2 Gold – 6 Silver – 8 Bronze - 7 Gold – 5 Silver – 7 Bronze - 4 Gold – 1 Silver – 3 Bronze - 1 Gold – 1 Silver – 6 Bronze - 2 All Other - 5 All Other - 1 All Other - 4 62 UNIX Schemas 26 Windows Schemas 4 VMWare Schemas • Visualizer database is 2.3 Tb. in size and divided into 92 schemas by: • Platform • Time Zone • Service Level • (limit to 1,000 servers per schema for performance) Console A Console D Console C Console B 8,566 476 domains 8,743 475 domains 11,970 485 domains 9,074 489 domains Number of Servers Collected from Nightly Automation
Data Collection Process • Perform binaries are laid down with the Patrol installation on the server (node) • A collector runs on each server (node) and writes data to disk periodically (currently set to 15 minutes) • The data is pulled by the Perform Console and processed nightly (hourly summarization) creating “vis” files • Nightly automation consists of 3 processes: • Retrieve • Analyze • Populate
Monitoring Environment Results • Nightly automation stats • 7 time zones • 39 states • 256 cities • 1,925 domains • 1,947 VIS files • 38,353 servers • 621,615 UDR files • 13.5 new servers added per day over the last year (4,947)
Bonus Material BMC 7.5
Performance Assurance Release 7.5 New Features and Functionality New Virtualization Support • SUN Solaris Logical Domains (LDoms) • SUN Chip Multi-Threading (CMT) technology • IBM AIX Live Partition Mobility • IBM AIX Workload Partitions (WPARs) • IBM PowerVM • HP Integrity Virtual Machines (IVM) • Microsoft 2008 Virtualization Server (Hyper-V) Enhanced VMware Virtualization Support • Cluster, resource pool, disk and datastore metrics • Info on relationships between servers, virtual machines, pools • Perceiver support for cluster, resource pool and disk views • Improvements to proxy data collector • Complete re-design of Visualizer tables and relationships
Performance Assurance Release 7.5 New Features and Functionality (cont) Console Operations • Improvements to Manager for recovery and reprocessing of data • Manager exception reports • Officially supported Service Levels • New General Manager web application to manage Perform and Perceiver – daily operation and exceptions • UDR Transfer Utility • Changes to management of Hardware table for performance ratings • Changes to the Visualizer database structures • Problem resolutions and enhancement implementations
7.5 Migration Issues • Some Visualizer tables have been re-designed to accommodate metrics for virtual servers (current metrics may have moved to new tables). • The changes in Visualizer require migrating all data from the old 7.4 schemas to the new 7.5 schemas. • If multiple Consoles update the same Visualizer schema, all Consoles must be migrated to release 7.5 at the same time. • The Visualizer database migrations must be done at the same time the Consoles are migrated to release 7.5. • Therefore, in our environment, all Consoles and all Visualizer databases must be migrated to 7.5 simultaneously. • Per BMC, very large Visualizer schemas may take longer than a day to migrate to 7.5 (we have 90+ Visualizer schemas). • Per BMC, the most significant problems they have seen with the new release involves database migrations.