Magpie: Distributed request tracking for realistic performance modelling

Magpie: Distributed request tracking for realistic performance modelling Rebecca Isaacs Paul Barham Richard Mortier Dushyanth Narayanan Microsoft Research Cambridge James Bulpin University of Cambridge

Performance in distributed systems • Faults in distributed systems are notoriously hard to diagnose • Performance problems are even more subtle to debug • Often transient or affect only a subset of requests / users • Frequently involve complex interactions between multiple machines • Aggregate statistics (e.g. utilization) may look perfectly normal

Magpie Approach • Track individual requests end to end • Observe control flow (causality) • Monitor resource consumption: CPU, bandwidth, disk • Debug performance “in the small” • Build a probabilistic workload model from the aggregate requests • Cluster similar requests according to their observed behaviour • Debug performance “in the large”

How do we use this information? • Performance debugging • Why did this request take much longer than that request? • Fault detection • Configuration and management • Performance prediction • Realistic workload models for capacity planning • Obtain automatically on a “live” system

Magpie components • Instrumentation • System activity recorded to logs • Generic request parser • Extract individual requests from logs according to an event schema • Model construction • Behavioural clusters • Probabilistic state machine

Outline • Introduction • What is a request? • Instrumentation • Request extraction • Modelling • Current status

What is a request? • System activity which takes place in response to an action initiated by the application being traced • HTTP request • Database query • File open request • We describe a request as • The sequence of application components involved in its processing • The resource consumed at each stage • CPU, bandwidth, disk transfer size, (latency)

A typical e-commerce site (1) Internet Storage SQL Servers Web Front Ends

http.sys A typical e-commerce site (2) SQL Server Web Server CLR IIS Application Logic Filter Static Stored Content procedures ADO.NET ASP.NET Data WinSock2 API WinSock2 API Kernel Kernel

IIS worker thread picks up request ASP.NET thread blocks after http.sys IIS worker thread Sync WinSock send RPC to database wakes up to write log to SQL Server HTTP request HTTP response packets TDS request and reply packet ASP.NET worker sent back to client packets sent and thread takes over received SQL thread unblocks HTTP request: detailed view from ! WEB.eec - + + + + - - - - - - WEB.398 Disk Net RX Net TX 10.051s 10.155s 10.100s Net TX Net RX Disk - - - SQL.9c4 10.051s 10.100s 10.155s KEY: Blocked IIS ASP.NET SQL Disk Other

Why is request tracking hard? • Many components, multiple machines • Must track control flow across machines • No globally unique request ID • Components are developed independently • Multiple thread pools • Many threads participate in processing a request • Asynchronous communication • Must match send/recvs between threads/machines • Hand-rolled synchronization primitives • SQL server has user-mode scheduler

Event Tracing for Windows • Low-overhead event mechanism • Events timestamped with cycle counter • Global ordering on events on a single machine • Can enable/disable sets of events at runtime • Using ETW in Magpie • Each instrumentation point posts an event • Events are logged to disk • Logs are post-processed to extract requests • Can also consume events in real time

Instrumentation points • Existing ETW event providers • IIS, kernel • App-specific hooks • IIS, ASP.NET, SQL Server • Detours • Wrap dlls to trap Win32 and WinSock2 calls • WinPcap • Capture packets on the wire

CPU usage from kernel events • The ETW kernel logger records every context switch • How do we know which cycles are used for which request? • We can attribute cycles to a request by • An application-specific event which occurs within a delimited sector of CPU time, or • The current context of execution, eg thread id

Example: protocol processing in a DPC DPC start DPC end pkt recv cswitch Events: cswitch Request 1 cycle count time Request 2 cycle count

Application and middleware events • Cover points where flow of control moves between components • Cover points where resources are multiplexed and demultiplexed • E.g. user-level scheduling primitives • Propagation of a global request id is not required! • Magpie used to do this but not any more

Wrappers http.sys Instrumenting a web service SQL Server Web Server CLR IIS Application Extended SPs Logic Filter Static Stored Content procedures HTTPModule ADO.NET ASP.NET Data CLR profiler ISAPI Filter Intercept Intercept WinSock2 API WinSock2 API Kernel Kernel Event Tracing for Windows Event Tracing for Windows Packet capture Packet capture

Generic request extraction • No inbuilt assumptions about the system or the application • No common unique identifier • Schema specifies semantics of events • Easy to add new event types • Parser stitches events into requests based on event semantics

Terminology • Namespace • Event parameter which references an entity in the system, eg thread id • Timeline • Instantiation of a namespace with a unique value, eg thread id = 0xa • Events bind or unbind requests to timelines • Bindings capture the semantics of each event for a particular request type

Example: connecting events Recv returns Enter Recv DPC start DPC end TCP pkt cswitch cswitch Cpuid=0 Tid=0xa Tid=0xb Connid=0xd Request 1 Request 2

End-to-end request extraction • An instance of the request parser runs on each machine in the distributed system • Online or offline mode • Offline post-processing connects request fragments from each node according to a globally unique namespace, e.g. packet IP identifier

Clustering for workload generation • Target the Indy performance modelling tool • Calculates throughput, bottlenecks • Needs transaction mix, resource consumption • Previously: microbenchmark approach • Run 10000 of each “transaction type” (URL) • Divide aggregate resource usage by 10000 • Aim: provide realistic workload models • From real, mixed workloads • Derive transaction “types” automatically

Network Disk Single request: cartoon view • Partial ordering of events • Annotated with resource usage SQL Server CPU ASP.NET CPU IIS CPU

Behavioural clustering of requests • Represent requests as event strings • “Flatten” out any concurrency • Use Levenshtein string edit distance • Modified to factor in resource usage vectors • Cluster requests based on this distance • Linear-time algorithm • Each cluster is a request “type” • Select representative from near centroid

A 7% B E 10% 63% C 15% D 5% Build a workload model by clustering similar requests Requests in the same cluster often have different URLs, and one URL may appear in many clusters A B C E D

Taking it further: work-in-progress • Online and incremental modelling: • Detect component failure • Detect sudden shifts in workload • More sophisticated models • Learn the probabilistic state machine for each request • c.f. flowcharts annotated with performance information • “Bayesian watchdogs” • Compute the likelihood of a request’s behaviour as it moves through the system • Deal with “unlikely” requests appropriately

Current status • Recent focus has been developing a generic request extraction scheme • Prototype for 2-machine e-commerce site • TPC-W style workload • Prototype for single machine SQL Server 2000 • Challenge is user mode scheduler • TPC-C workload • Other applications on the way • Large-scale • “Real” systems with “real” performance problems

Conclusion • Magpie is a tool for performance analysis in a distributed system • Bottom up, per-request approach • Complementary to existing techniques: • Performance counters • Program profiling • Feeds into performance debugging and prediction tools

Magpie: Distributed request tracking for realistic performance modelling

Magpie: Distributed request tracking for realistic performance modelling

Presentation Transcript

Front Tracking

Lecture 18 Hydrological modelling

Systems Modelling in Cell Biology

Models of Human Performance

Systems Modelling in Cell Biology

Distributed Objects and Components

Chapter 23

Designing Distributed Applications using Mobile Agents

Presented by Cor Faling Mosedimosi Business Training

Chapter 22: Distributed Databases

Part 2 Distributed Systems 2009

Distributed Database

Chapter 19: Distributed Databases

Distributed Systems: Distributed algorithms

Distributed Parallel Computing

Distributed Databases

Ecological modelling

DISTRIBUTED COMPUTING

High Level Overview of… Request Management

Distributed Database Systems