560 likes | 667 Views
Seeing the Forest AND the Trees: Capacity Planning for a Large Number of Servers. Linwood Merritt Capital One Services, Inc. linwood.merritt@capitalone.com. Introduction: Environment. Capital One 4th largest card issuer in the United States Capital One to S&P 500 in 1998
E N D
Seeing the Forest AND the Trees: Capacity Planning for a Large Number of Servers Linwood Merritt Capital One Services, Inc. linwood.merritt@capitalone.com page 1
Introduction: Environment • Capital One • 4th largest card issuer in the United States • Capital One to S&P 500 in 1998 • Fortune 500 company starting in 2000 • Managed loans at $71.2 billion as of Q4 2003 • Accounts at 47.0 million as of Q4 2003 • CIO 100 Award “Master of the Customer Connection” • Information Week “Innovation 100” Award Winner • ComputerWorld “Top 100 places to work in IT” page 2
Categories of Issues • Acquiring and using business data • Exception detection • Platform types and operating systems • Data capture analysis • Organization and reporting of server structure • Bulk Capacity Planning • Business driver based forecasting • Visualization techniques page 3
Acquiring Business Data • Chaney, Bob, “The Capacity Performance Council, Start Yours Today,” CMG1999 Proceedings • Chaney, Bob, “Divide and Conquer: Implementing the Capacity Performance Council in Pieces,” CMG2001 Proceedings • Merritt, Linwood, “A Capacity Planning Partnership with the Business,” CMG2002 Proceedings and UKCMG 2003 page 4
Business Data Inputs • “Pull” • Capacity impact forms • Interview meetings • Phone calls • e-mails • “Push” • Capacity Councils page 5
Capacity Councils • Cross-organizational structure • Purpose: Bring together business and technical views of applications and supporting technologies. • Evolution: Single Capacity Council, multiple Capacity Councils along business lines, multiple Technical Councils along types of platforms (mainframe “MVS,” Unix, NT, etc.) page 6
Capacity Council Deliverables • Monthly meeting • Evaluation of Business Area Capacity Status (“stoplight” green/yellow/red color) • Business driver mapping to servers • Monthly report page 7
“Stoplight” Status • Green: Little concern about hitting capacity constraints in the next 6 months • Yellow: Concern about capacity in the next 4-6 months • Red: Very high concern about not meeting capacity needs in the next three months page 8
Capacity Planners’ Responsibilities CapacityPlanners PlannerA PlannerB PlannerC PlannerD PlannerE PlannerF Clubhouse Mgmt Fairway shots Mashies Chip Mgmt Putt Mgmt Foursome Pitch Mgmt Tee shot CapacityCouncils page 9
Business Drivers • Capacity Councils: Business units responsible for capacity planning of “demand” side • Capacity Planners: Build “supply side” projections based on business drivers and historical trending page 10
Business Driver Based Forecasting • Map business drivers to servers. • Use historical data to correlate. • Use business driver projections to build forecast. • Technical approaches (Spreadsheet and SAS) • Exponential forecast comparison • Multivariate regression page 11
Business Driver and Resource Regression =FORECAST(Cx,$B$3:$B$26, $C$3:$C$26) page 12
Combination of Business Driver and Resource Date-Based Projections page 13
Forecast Using Date and Business Driver =FORECAST(A27,$B$7:$B$26,$A$7:$A$26)*C27/FORECAST(A27,$C$7:$C$26,$A$7:$A$26) page 14
SAS Regression proc reg noprint data=forecast outest=regdata tableout ; by machine shift; model cpureg = %CtReg / selection=rsquare noint; page 15
Actual vs. Projected page 16
Aggregate Actual vs. Projected (Average Actual vs. Average Projected) page 20
Exception Detection • Statistical Analysis (standard deviations from mean) • Exception Detection System developed by Igor Trubin, Ph.D., of Capital One • “Exception Detection System, Based on the Statistical Process Control Concept,” CMG2001 • “Global and Application Level Exception Detection System, Based on MASF Technique,” CMG2002 and UKCMG 2003 • Reporting: E-mail, web reports page 22
Statistical Process Control page 23
Exception Detection E-Mail Exception Detection Report for 05/02 _____________________________________________________ CPU_Utilization exception unix/unisys/tandem/MVS 5 boxes list: ServerA ServerC ServerE ServerL ServerZ _____________________________________________________ CPU_Utilization NULL DATA unix/unisys/tandem/MVS 0 boxes list: _____________________________________________________ CPU_Utilization insufficient DATA unix/unisys/tandem/MVS 1 boxes list: ServerG =============================================== CPU utilization was greater than 50% yesterday for: ServerA ServerD page 24
Exception Reporting page 25
Platform Types • “MVS” mainframe • “Flavors” of Unix • NT servers • Non- “standard” such as Unisys, Tandem, HP3000, native commands, etc. • Different data formats and locations page 26
Level of DetailMainframe • Global (by Sysplex) • Partition (LPAR) • Service Class • SMF (use job names and account codes) • Hardware may be shared among workloads and business units. page 27
Level of DetailDistributed • Global (by server) • Application or Workload • Assigned within data collection product • Assigned within Capacity/Performance database code • Processes by descending CPU% • Overall view of utilization (for consolidation opportunities) page 28
Integrated Products • Products with integrated reporting capabilities • Extract data, port to the existing Performance Database system. • Interface directly with the product databases (e.g. with SAS ODBC). • Link web HTML pages to product graphs. page 29
5 or 10 Min Samples Sequential File(s) Hourly CPU trace(if available) Web Files Multiple Platform Types Non-Standard Platforms “MVS” Mainframe Native Commands Product Extract Product Database Sequential File(s) Sequential Files NRJE Hourly Sum- maries Capacity Bridge Network ftp Server ftp ftp Capacity Mainframe Remote Server with Product PDBs Data Product Database Extract ftp Sequential Files Sequential File(s) Product Database Web Files ftp Capacity Programs ftp Workstation Graphics (Other Remote Platforms) SNMP Performance Data Reports Html and Graphics Files LAN / WAN page 30
Data Capture Analysis • Operational side to Capacity Planning: Automation of data collection, performance database population, and report creation • Automated process to check the successful and timely completion of each step of the process • Included in exception analysis mechanism • Ongoing tuning effort as complexity and volume increases page 31
Organization and Reporting of Server Structure • Database of server characteristics and assignments • Business unit classifications • Applications • Capacity Planner assignments • Configuration details • Status color codes page 32
Server Database page 33
Use of Server Database • Central repository of capacity information • Data source to build browser pages on the company’s Intranet • Color-coded view of servers by business area and application. page 34
Matrix-Based Reporting page 35
Bulk Capacity Planning • Analyze large number of servers in a single pass. • Import measured and trended CPU utilization of each server. • Assign servers to business areas. • Allow assignment of business drivers (with growth rates) to each server • Allow assignment of upgrade thresholds to each server. page 37
“Bulk Capacity Planning” Projections • For each server, calculate the month where projected CPU utilization crosses the upgrade threshold, for three growth rates. • Use “conditional formatting” to flag server upgrade dates as red or yellow if the date is of concern. page 38
Calculation of #Years Before Upgrade Threshold = Base * (1+AnnualGrowth) (#Years) Log(Threshold) = Log(Base) + (#Years) * log(1+AnnualGrowth) (#Years) = ( Log(Threshold) - Log(Base) ) / Log(1+AnnualGrowth) page 39
Visualization Techniques • Different views of the same data • Web-based (HTML and Java) reports • “Stoplight” (green/yellow/red) coded status • Overlay presentation of trends and forecasts • “Thumbnail” charts with drilldown capabilities page 41
Visualization Techniques (Continued) • Automatic generation of static HTML • Dynamic HTML (CGI bin, web portal) • Representation of servers as color-coded rectangles on a single page, where the area of each rectangle represents its capacity rating. page 42
Different Views of Same Data • Servers can appear more than once (multiple applications and assignments). • “Production” vs. “All” • “All Departments” (no duplicates) vs. “By Department” page 43
HTML with Hyperlinks <I><P ALIGN="CENTER">Table of Contents</P> </I><B><DIR> <DIR><FONT SIZE=3> <P><A HREF="#Intro">Introduction</A></P> ………………………………………………………………. <P><A HREF="#DASD">Storage Analysis</A></P> ………………………………………………………………. <A NAME="DASD"></A> <FONT FACE="Arial" SIZE=5><B><P ALIGN="CENTER">Storage Analysis</P> </B></FONT><FONT FACE="Arial" SIZE=3> <P ALIGN="JUSTIFY">Presented below are workload-based DASD space projections. Additional detail can be found in the <A HREF="#DASDPRF">DASD Workload Profile</A>.</P> <P ALIGN=CENTER><IMG SRC="dasd99.gif" usemap="#Objmap" BORDER=1></P> </FONT><FONT FACE="Arial" SIZE=3><I> <P ALIGN="CENTER">Figure 4 - DASD Space Analysis by Workload</P></I> </FONT> page 44
Web Graphs from SAS TITLE2 F=SIMPLEX C=RED J=C H=1.3 ”ServerA CPU BY DAY"; PROC GCHART GOUT=GOUT.DATE3; WHERE ( SYSTEM =: 'AVG_ServerA' ); VBAR DATE / SUMVAR=CPU SPACE = 0 SUBGROUP=WKLD DISCRETE CAXIS=BLUE CTEXT=RED NAME="DATE" DESCRIPTION="DATE GRAPH”; /****************************************/ GOPTIONS DEVICE=GIF GSFNAME=GIFOUT GPROTOCOL=SASGPASC CBACK=BWH BORDER HSIZE=6 VSIZE=4 GSFMODE=REPLACE GSFLEN=128; PROC GREPLAY IGOUT=GOUT.DATE2 NOFS; REPLAY 1; page 45
Indexed Report page 46
Anchored Links Web Page 1 Web Page 2 page 47
Thumbnail Graphs page 48
Automatic Generation of HTML • Driven by server database • SAS or Visual Basic code - builds web pages and hyperlinks page 49
Color-Coded Rectangles “Treemap” paper by Ben Shneiderman, University of Maryland, http://www.cs.umd.edu/hcil/treemaps page 50