1 / 17

Closing the WTF – NTF Gap

Closing the WTF – NTF Gap. A whimsical but yet highly practical guide to getting to ‘yes’ in the Distributed Performance Arena. David Halbig. Thursday, Oct 1, 2009. Agenda. Non-Distributed Model Distributed Model (traditional) Case Studies Distributed Model (proposed) Summary.

harlan
Download Presentation

Closing the WTF – NTF Gap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Closing the WTF – NTF Gap A whimsical but yet highly practical guide to getting to ‘yes’ in the Distributed Performance Arena David Halbig Thursday, Oct 1, 2009

  2. Agenda • Non-Distributed Model • Distributed Model (traditional) • Case Studies • Distributed Model (proposed) • Summary

  3. Non-Distributed Diagnostic Model • One Riot / One Ranger (Some multi-disciplinary team investigations) • Response time or Deadline-centric Objectives, Instrumentation, and Tools

  4. Midrange Diagnostic Model (traditional) • Assemble large team with (at least) • One representative from each environment component (applications, network, DB, server (one per class)) • Project Manager(s) • Misc HLEs, depending on the severity of the slowdown/outage • Requesting status from each environmental component • Begin (in no particular order) • Recycling servers • Removing components from the environment • Blamestorming / RoT / Request diagnostics from components • (rinse, repeat)

  5. Key Characteristics of This Model • Fragmented Instrumentation • Little primary information sharing • At ‘jurisdictional boundaries’ • No agreed-on SLAs • No agreed-on metrics • No common tool use • Focus on Utilization, not Delivered Service

  6. Case Study #1 - VDI environment • 450+ Virtual Desktop Instances (WinXP on Vmware ESX 3.5) • Geographically dispersed user community • Varied workload characteristics, from CSR support to code development • Intermittent severe response time problems across random selection of VDI’s • All major components reported back ‘NTF’

  7. Case Study #1 – VDI environment C: \ Drive Seconds/Read

  8. Case Study #1- VDI environment • Perfmon analysis indicated intermittent severe I/O response time problems • Utilization-centric reporting from SAN layer reported no severe problems • Vmware layer reported utilization, but no response time data for SAN layer • Esxtop data, with 30-second reporting interval showed intermitted severe I/O response time problem @ HBA

  9. Case Study #1 – VDI environment • SAN reporting did not include all layers • SAN reporting was too coarse a granularity (15 mins) • SAN reporting upgraded to report response time and to regularly report with finer granularity (30 seconds) • (soooo… what was the problem?)

  10. Case Study #1 – VDI Environment RTVSCAN - DIO

  11. Case Study #2 – eCustomerService • Moderate-volume web-based application • Facing retail card holder population • Intermittent response time delays

  12. Case Study # 2 - eCustomerService • Network trace shows ‘declining TCP/IP window size’ • OS team reports ‘NTF’ • Response time decomp tool reports delay between web and app layer

  13. Case Study #2 - eCustomerService

  14. Case Study # 2 – eCustomerService

  15. Case Study # 2 - eCustomerService • Conflict between ‘I know’ and ‘I believe’ resolved by mgmt intervention • OS vendor engaged for deep dive into TCP/IP stack and web application

  16. Midrange Diagnostic Model (proposed) • End-to-end transaction monitoring • Explicit response time decomposition • For crucial subsystems (example: I/O), full chain-of-custody instrumentation • At ‘Jursdictional Boundaries’: • Agreed-on metrics (response time) • Agreed on instrumentation (tool/interval)

  17. Midrange Diagnostic Model (proposed) • Train to common end-to-end tool • Train to common component tools • START with response time/Delivered Service metrics/tools, END with utilization centric metrics/tools • Approach other teams with probable cause only • Staffing/authority model(s): • Trained performance analysts with ‘hot pursuit’ authority • Trained performance analysts with advisory authority (only)

More Related