1 / 32

Magpie: Distributed request tracking for realistic performance modelling

Magpie: Distributed request tracking for realistic performance modelling . Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge. Performance in distributed systems.

brayton
Download Presentation

Magpie: Distributed request tracking for realistic performance modelling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge

  2. Performance in distributed systems • Faults in distributed systems are notoriously hard to diagnose • Performance problems are even more subtle to debug • Often transient or affect only a subset of requests / users • Frequently involve complex interactions between multiple machines • Aggregate statistics (e.g. utilization) may look perfectly normal

  3. Magpie Approach • Track individual requests end to end • Observe control flow (causality) • Monitor resource consumption: CPU, bandwidth, disk • Debug performance “in the small” • Build a probabilistic workload model from the aggregate requests • Cluster similar requests according to their observed behaviour • Debug performance “in the large”

  4. How do we use this information? • Performance debugging • Why did this request take much longer than that request? • Fault detection • Configuration and management • Performance prediction • Realistic workload models for capacity planning • Obtain automatically on a “live” system

  5. Magpie components • Instrumentation • System activity recorded to logs • Generic request parser • Extract individual requests from logs according to an event schema • Model construction • Behavioural clusters • Probabilistic state machine

  6. Outline • Introduction • What is a request? • Instrumentation • Request extraction • Modelling • Current status

  7. What is a request? • System activity which takes place in response to an action initiated by the application being traced • HTTP request • Database query • File open request • We describe a request as • The sequence of application components involved in its processing • The resource consumed at each stage • CPU, bandwidth, disk transfer size, (latency)

  8. A typical e-commerce site (1) Internet Storage SQL Servers Web Front Ends

  9. http.sys A typical e-commerce site (2) SQL Server Web Server CLR IIS Application Logic Filter Static Stored Content procedures ADO.NET ASP.NET Data WinSock2 API WinSock2 API Kernel Kernel

  10. IIS worker thread picks up request ASP.NET thread blocks after http.sys IIS worker thread Sync WinSock send RPC to database wakes up to write log to SQL Server HTTP request HTTP response packets TDS request and reply packet ASP.NET worker sent back to client packets sent and thread takes over received SQL thread unblocks HTTP request: detailed view from ! WEB.eec - + + + + - - - - - - WEB.398 Disk Net RX Net TX 10.051s 10.155s 10.100s Net TX Net RX Disk - - - SQL.9c4 10.051s 10.100s 10.155s KEY: Blocked IIS ASP.NET SQL Disk Other

  11. Why is request tracking hard? • Many components, multiple machines • Must track control flow across machines • No globally unique request ID • Components are developed independently • Multiple thread pools • Many threads participate in processing a request • Asynchronous communication • Must match send/recvs between threads/machines • Hand-rolled synchronization primitives • SQL server has user-mode scheduler

  12. Outline • Introduction • What is a request? • Instrumentation • Request extraction • Modelling • Current status

  13. Event Tracing for Windows • Low-overhead event mechanism • Events timestamped with cycle counter • Global ordering on events on a single machine • Can enable/disable sets of events at runtime • Using ETW in Magpie • Each instrumentation point posts an event • Events are logged to disk • Logs are post-processed to extract requests • Can also consume events in real time

  14. Instrumentation points • Existing ETW event providers • IIS, kernel • App-specific hooks • IIS, ASP.NET, SQL Server • Detours • Wrap dlls to trap Win32 and WinSock2 calls • WinPcap • Capture packets on the wire

  15. CPU usage from kernel events • The ETW kernel logger records every context switch • How do we know which cycles are used for which request? • We can attribute cycles to a request by • An application-specific event which occurs within a delimited sector of CPU time, or • The current context of execution, eg thread id

  16. Example: protocol processing in a DPC DPC start DPC end pkt recv cswitch Events: cswitch Request 1 cycle count time Request 2 cycle count

  17. Application and middleware events • Cover points where flow of control moves between components • Cover points where resources are multiplexed and demultiplexed • E.g. user-level scheduling primitives • Propagation of a global request id is not required! • Magpie used to do this but not any more

  18. Wrappers http.sys Instrumenting a web service SQL Server Web Server CLR IIS Application Extended SPs Logic Filter Static Stored Content procedures HTTPModule ADO.NET ASP.NET Data CLR profiler ISAPI Filter Intercept Intercept WinSock2 API WinSock2 API Kernel Kernel Event Tracing for Windows Event Tracing for Windows Packet capture Packet capture

  19. Outline • Introduction • What is a request? • Instrumentation • Request extraction • Modelling • Current status

  20. Generic request extraction • No inbuilt assumptions about the system or the application • No common unique identifier • Schema specifies semantics of events • Easy to add new event types • Parser stitches events into requests based on event semantics

  21. Terminology • Namespace • Event parameter which references an entity in the system, eg thread id • Timeline • Instantiation of a namespace with a unique value, eg thread id = 0xa • Events bind or unbind requests to timelines • Bindings capture the semantics of each event for a particular request type

  22. Example: connecting events Recv returns Enter Recv DPC start DPC end TCP pkt cswitch cswitch Cpuid=0 Tid=0xa Tid=0xb Connid=0xd Request 1 Request 2

  23. End-to-end request extraction • An instance of the request parser runs on each machine in the distributed system • Online or offline mode • Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier

  24. Outline • Introduction • What is a request? • Instrumentation • Request extraction • Modelling • Current status

  25. Clustering for workload generation • Target the Indy performance modelling tool • Calculates throughput, bottlenecks • Needs transaction mix, resource consumption • Previously: microbenchmark approach • Run 10000 of each “transaction type” (URL) • Divide aggregate resource usage by 10000 • Aim: provide realistic workload models • From real, mixed workloads • Derive transaction “types” automatically

  26. Network Disk Single request: cartoon view • Partial ordering of events • Annotated with resource usage SQL Server CPU ASP.NET CPU IIS CPU

  27. Behavioural clustering of requests • Represent requests as event strings • “Flatten” out any concurrency • Use Levenshtein string edit distance • Modified to factor in resource usage vectors • Cluster requests based on this distance • Linear-time algorithm • Each cluster is a request “type” • Select representative from near centroid

  28. A 7% B E 10% 63% C 15% D 5% Build a workload model by clustering similar requests Requests in the same cluster often have different URLs, and one URL may appear in many clusters A B C E D

  29. Taking it further: work-in-progress • Online and incremental modelling: • Detect component failure • Detect sudden shifts in workload • More sophisticated models • Learn the probabilistic state machine for each request • c.f. flowcharts annotated with performance information • “Bayesian watchdogs” • Compute the likelihood of a request’s behaviour as it moves through the system • Deal with “unlikely” requests appropriately

  30. Outline • Introduction • What is a request? • Instrumentation • Request extraction • Modelling • Current status

  31. Current status • Recent focus has been developing a generic request extraction scheme • Prototype for 2-machine e-commerce site • TPC-W style workload • Prototype for single machine SQL Server 2000 • Challenge is user mode scheduler • TPC-C workload • Other applications on the way • Large-scale • “Real” systems with “real” performance problems

  32. Conclusion • Magpie is a tool for performance analysis in a distributed system • Bottom up, per-request approach • Complementary to existing techniques: • Performance counters • Program profiling • Feeds into performance debugging and prediction tools

More Related