170 likes | 275 Views
Closing the WTF – NTF Gap. A whimsical but yet highly practical guide to getting to ‘yes’ in the Distributed Performance Arena. David Halbig. Thursday, Oct 1, 2009. Agenda. Non-Distributed Model Distributed Model (traditional) Case Studies Distributed Model (proposed) Summary.
E N D
Closing the WTF – NTF Gap A whimsical but yet highly practical guide to getting to ‘yes’ in the Distributed Performance Arena David Halbig Thursday, Oct 1, 2009
Agenda • Non-Distributed Model • Distributed Model (traditional) • Case Studies • Distributed Model (proposed) • Summary
Non-Distributed Diagnostic Model • One Riot / One Ranger (Some multi-disciplinary team investigations) • Response time or Deadline-centric Objectives, Instrumentation, and Tools
Midrange Diagnostic Model (traditional) • Assemble large team with (at least) • One representative from each environment component (applications, network, DB, server (one per class)) • Project Manager(s) • Misc HLEs, depending on the severity of the slowdown/outage • Requesting status from each environmental component • Begin (in no particular order) • Recycling servers • Removing components from the environment • Blamestorming / RoT / Request diagnostics from components • (rinse, repeat)
Key Characteristics of This Model • Fragmented Instrumentation • Little primary information sharing • At ‘jurisdictional boundaries’ • No agreed-on SLAs • No agreed-on metrics • No common tool use • Focus on Utilization, not Delivered Service
Case Study #1 - VDI environment • 450+ Virtual Desktop Instances (WinXP on Vmware ESX 3.5) • Geographically dispersed user community • Varied workload characteristics, from CSR support to code development • Intermittent severe response time problems across random selection of VDI’s • All major components reported back ‘NTF’
Case Study #1 – VDI environment C: \ Drive Seconds/Read
Case Study #1- VDI environment • Perfmon analysis indicated intermittent severe I/O response time problems • Utilization-centric reporting from SAN layer reported no severe problems • Vmware layer reported utilization, but no response time data for SAN layer • Esxtop data, with 30-second reporting interval showed intermitted severe I/O response time problem @ HBA
Case Study #1 – VDI environment • SAN reporting did not include all layers • SAN reporting was too coarse a granularity (15 mins) • SAN reporting upgraded to report response time and to regularly report with finer granularity (30 seconds) • (soooo… what was the problem?)
Case Study #1 – VDI Environment RTVSCAN - DIO
Case Study #2 – eCustomerService • Moderate-volume web-based application • Facing retail card holder population • Intermittent response time delays
Case Study # 2 - eCustomerService • Network trace shows ‘declining TCP/IP window size’ • OS team reports ‘NTF’ • Response time decomp tool reports delay between web and app layer
Case Study # 2 - eCustomerService • Conflict between ‘I know’ and ‘I believe’ resolved by mgmt intervention • OS vendor engaged for deep dive into TCP/IP stack and web application
Midrange Diagnostic Model (proposed) • End-to-end transaction monitoring • Explicit response time decomposition • For crucial subsystems (example: I/O), full chain-of-custody instrumentation • At ‘Jursdictional Boundaries’: • Agreed-on metrics (response time) • Agreed on instrumentation (tool/interval)
Midrange Diagnostic Model (proposed) • Train to common end-to-end tool • Train to common component tools • START with response time/Delivered Service metrics/tools, END with utilization centric metrics/tools • Approach other teams with probable cause only • Staffing/authority model(s): • Trained performance analysts with ‘hot pursuit’ authority • Trained performance analysts with advisory authority (only)