2.01k likes | 2.71k Views
Performance by Design using the .NET Framework. Mark Friedman Rico Mariani Vance Williams Architect Developer Division Microsoft Corporation. Agenda. Teaching Performance Culture – RicoM Managed Code – RicoM CPU Optimization – VanceM Memory – VanceM Threading – VanceM
E N D
Performance by Design using the .NET Framework Mark Friedman Rico Mariani Vance Williams Architect Developer Division Microsoft Corporation
Agenda • Teaching Performance Culture – RicoM • Managed Code – RicoM • CPU Optimization – VanceM • Memory – VanceM • Threading – VanceM • Web application scalability – MarkFr • Web application responsiveness – MarkFr
Performance By Design Rico Mariani Architect Microsoft Corporation
Introduction • Part 1 – Teaching Performance Culture • Part 2 – General Topics about Managed Code
Rule #1 • Measure • Just thinking about what to measure will help you do a good job • Performance will happen • If you don’t measure you can be sure it will be slow, big, or whatever else you don’t want • If you haven’t measured, your job’s not finished
Rule #2 • Do your homework • Good engineering requires you to understand your raw materials • What are the key properties of your Framework? Your processor? Your target system?
No more rules • Very few absolutes in the performance biz • Performance work is plagued with powerful secondary and tertiary effects that often dwarf what we think are the primary effects • Whenever considering advice you must remember Rule #1 • Don’t let nifty sounding quotes keep you from great performance
Performance Culture • Budget • an exercise to assess the value of a new feature and the cost you’d be willing to pay • Plan • validate your design against the budget, this is a risk assessment • Verify • measure the final results, discard failures without remorse or penalty, don’t make us live with them
Budget • Begin by thinking about how the customer thinks about performance • Responsiveness • Capacity • Throughput • Cost of Entry • Identify the resource the customer views as critical to this system • Choose the level of performance we want to deliver (do we need an “A+” or is a “D” good enough) • Convert this into what resource usage needs to be to succeed • Don’t think about the code, think about the customer
Plan • You can’t plan without a budget, so get one • Use best practices to select candidate algorithms • Understand their costs in terms of the critical resource • Identify your dependencies and understand their costs • Compare these projected costs against the budgets • If you are close to budget you will need much greater detail in your plans • Identify verification steps and places to abort if it goes badly • Proceed when you are comfortable with the risk
Verify • The budget and the plan drive verification steps • Performance that cannot be verified does not exist • Don’t be afraid to cancel features that are not meeting their budgets – we expect to lose some bets • Don’t inflict bad performance on the world
What Goes Wrong (I) • Programs take dependencies that they fundamentally cannot afford • E.g. Hash and comparison functions that call string splitting functions • Solution • Understand the costs of your dependencies in terms of your critical resource • Use dependencies in the context they were intended
What Goes Wrong (II) • Programs use an algorithm that is fundamentally unsuitable • E.g. mostly sorted data passed to a quicksort • Solution • Model your real algorithm with real data before you assume its ok
What Goes Wrong (III) • Programs do a lot of work that doesn’t constitute “forward progress” • E.g. converting from one format to another in multiple stages each of which re-copies the data • Solution • Score your algorithms relative to the minimum work required to get the job done • This is the number one reason code is slower than it could be
What Goes Wrong (IV) • Programs are designed to do more than they need to do • E.g. arbitrary extensibility even to the point where it is too complicated to be useful • Solution • Focus only on your customers needs • Usable first, then re-usable
Agenda • Teaching Performance Culture • Managed Code • CPU Optimization • Memory • Threading • Web application scalability • Web application responsiveness
Selected Topics (I) : Object Lifetime • Good lifetime looks like single digit time in the collector • Bad lifetime looks like Mid-life Crisis in the Datacenter
Selected Topics (II) : JIT, Ngen, GAC • Many consequences, no perfect answer • Jitting implies private pages • Why is that bad? • Why can it be good? • Ngen is designed to combat these effects • But it’s a mixed blessing too • Shareable code has its costs/benefits • GAC increases the usefulness, also at cost
Selected Topics (III) : Measurement • Choose the right tool for the right problem • Identify sources of consumption (perfmon) • Consider • CPU profilers (like the one in Visual Studio Team System) • Memory profilers, like CLRProfiler • Resource trackers like filemon, regmon • .NET Stopwatch, and others… more on this later today • ETW gives you the best of everything, especially on Vista – but the tooling is still immature
Selected Topics (IV): Collection Classes • The most glaring performance problems (e.g. enumeration of ArrayList were addressed by the Generic collections, chance for a redo) • Beware of using Collections as your “flagship storage” – they are not the most frugal
Selected Topics (V): Exceptions • Managed code pays less for the presence of exceptions, but pays more for the throws • rich state capture, complex state examination • Use of exceptions for anything unexceptional can easily torpedo your performance • Exceptions access “cold” memory
Final words • Understand your goals • Understand the costs of what you use • Be ruthless about measuring so that you’ve done the full job • Keep reading and experimenting so you can learn aspects of the system that are most relevant to you • Share your wisdom with your friends • Insist on performance culture in your group • Don’t forget Rule #1 and Rule #2 !!!
References • Rico Mariani’s Performance Tidbits • http://blogs.msdn.com/ricom • Patterns and Practices Performance References • http://msdn.microsoft.com/en-us/library/aa338212.aspx • “Maoni’s Weblog” • http://blogs.msdn.com/maoni/ • “If broken it is, fix it you should” • http://blogs.msdn.com/tess/
CPU Optimization for .NET Applications Vance Morrison Performance Architect Microsoft Corporation
Overview: How to Measure • Along the way • A Little Theory • Pitfalls to avoid • Tricks of the trade • Low Tech: Stopwatch • Medium Tech: MeasureIt • Higher Tech: Sample Based Profiling (CPU) • Future: Instrumenting your code for perf
Measure, Measure, Measure • This talk is about exactly how (Demos!) • Its all about TIME • We virtualize most other resources, so everything is in the currency of ‘time’ • Measuring Time • Low Tech: System.Diagnostics.Stopwatch • Medium Tech: MeasureIt (Automates Stopwatch) • Medium Tech: Use ETW (Event Tracing for Windows) • Higher Tech: Sample Based Profiling. • The Key is to make it EASY so you will DO it
Low Tech: System.Diagnostics.Stopwatch • This technique is surprisingly useful • Stopwatch is a high resolution Timer • Very Straightforward Stopwatch sw = Stopwatch.StartNew(); // Something being measured sw.Stop(); Console.WriteLine("Time = {0:f3} MSec", sw.Elapsed.TotalMilliseconds); • Pitfalls • Measuring very small time (< 1 usec) • Clock skew on multiprocessors (Each CPU has a clock) • CPU throttling • Variance in measurements (Noise) • Dubious extrapolations.
For CPU, Run CPU at Max Clock • If you don’t do this, CPU is typically throttled to save power • Throttling causes less consistent measurements 1) Go to Control panel’s Power Options 2) Set to High Performance
demo What Runtime Primitives Cost (MeasureIt) Vance Morrison
Sample Based CPU Profiling • Next Easiest Technique • Visual Studio Does uses this by Default • Theory • At periodic intervals (e.g. 1msec) the CPU is halted • A complete stack trace is taken. • Afterward, samples with the same stack are combined. • If the sampling is statistically independent • Then the number of samples is proportional to the time spend at that location • Good attributes • Low overhead (typical < 10%) • No change to program needed • Less likely to perturb the perf behavior of the program • Does not measure ‘blocked’ time
Pitfalls of Sampling • Samples related to CPU Time NOT REAL time. • Samples are not taken when process is not running. • Thus only useful if CPU is the bottleneck! • VS Profiler does not sample while in the OS, thus system time is also excluded. • You need enough samples to have high accuracy • In general error proportional to 1/SQRT(n) • 10 samples has 64% error • 100 samples has 20% error • 1000 samples has 6% error
demo Sample Based ProfilingIn Visual Studio Vance Morrison
Future: Event Tracing For Windows • Windows already has fast, accurate logging • XPERF tool will display the logged data • The Framework already exposes ETW logging • System.Diagnostics.Eventing.EventProvider • However it is not easy to use End-to-End • We are working it • We will have more offerings in next release • It is a complete talk just by itself • If you need logging NOW you CAN use EventProvider, xperf • If you can wait a year, it will be significantly nicer. • If there is interest, we can have an ‘Open Space’ discussion
Resources (Keywords for Web Search) • Measure Early, Measure Often for Performance (MeasureIt) • Visual Studio Profiler Team Blog • CLR Performance Team’s Blog • Instructions on investigating suspicious DLL Loads. • Improving .NET Application Performance and Scalability • Lutz Roeder .NET Reflector for inspecting code. • Xperf (Pigs can Fly) • Vance Morrison’s Blog • Rico Mariani’s Blog
Related Sessions Related Labs
Memory Optimization for .NET Applications Vance Morrison Performance Architect Microsoft Corporation
Outline • When Should Care about Memory? • Theory • Memory Usage Breakdown of a process • The Garbage Collected Heap • Characteristics of the GC • Practice (Tools) • Process Level (Task Manager / PerfMon) • Module Level (VaDump) • Object level (ClrProfiler)
When Memory Affects Time • If your computation is not local, memory matters! • Cold Startup: At first all CODE is from Disk • Disk can transfer at best 50Meg/Sec, worst 1Meg/Sec • However OS caches Disk (thus only ‘New’ Code is bad). • Sluggishness when App Switching • Start App1 • Users switches to App2, • App 2 ‘steals’ physical memory of App1 (paged out) • User switches back to App 1, memory must be paged in • Servers are constantly ‘App Switching’
Memory Breakdown of a .NET App • Memory Mapped Files (Mostly Shared) • Loaded DLLs • Code / Read-Only Data (Can be shared) • Read/Write data (Private to one process) • Other Mapped Files (Fonts, Registry) • Dynamic Data (Not Shared) • Unmanaged Heaps • Stack • System support (VM Page Entries, TEB) • Direct VirtualAlloc calls • Runtime’s Garbage Collected (GC) Heap
Viewing an Apps Memory Breakdown • Task Manager (Built In) • Working Set, and Private Working set are the interesting columns • Commit Size is NOT Interesting • Can’t distinguish between shareable and shared. • PerfMon (Built In) • Can also get Information from Task Manager • Can display it as an ongoing Graph • Can get details on the .NET GC Heap • VaDump (Free Download) • Shows breakdown within a single process • Shows down to DLL granularity • Shows Shareable vs Shared distinction.
Tools for viewing Memory Use • Task Manager • Win-R -> taskMgr Select Columns Working Set tends to Overestimates Memory Impact (shared OS files) But does not account for read files. Private Working Set Underestimates Impact Small < 20Meg WS Med ~ 50 Meg WS Large > 100 Meg WS Small < 5 Meg Private Med ~ 20 Meg Private Large > 50 Meg Private
Getting Overview of GC Memory • PerfMon (Performance Monitor) • Win-R -> PerfMon
Performance Monitor : Add Counters Adding New Counters
Performance Monitor: GC Counters 1 Open .NET CLR Memory 2 Select Counters 3 Select Process 4 Add 5 OK
Performance Montor: GC Info Set Display to Report GC Heap Size % Time in GC So 7.3 Meg of the 8.6 Meg of private working set is the GC Heap % Time in GC ideally less than 10% Ratios of GC Generations Gen0 = 10 x Gen 1, Gen 1 = 10 x Gen 2
Drilling In Further: VADump • VaDump –sop ProcessID 1 Total WS 2 Breakdown 3 DLL Breakdown GC Heap Here Only These Affect Cold Startup
Typical Memory Breakdown Results • Memory Is Mostly from DLL Loads • Typically cold startup is very bad (Since data must come disk) • Private + Sharable but not Shared is the metric of interest • Eliminating unnecessary (unshared) DLL loads first attack • See CLR Performance Team Blog on Tracking down DLL loads • Eliminating amount of code touched is next. • Memory Is Mostly In GC Heap • Does not affect cold startup much but can affect warm startup • If GC heap large (> 10s of Meg), probably degrading throughput • GC Time is proportional to number of pointer in surviving data • In Either Case when Working Set Large (> 10Meg) • Throughput is lost due to cache misses • Server workloads are typically cache limited
Fixing Memory Issues: Prevention! • Fixing Memory Issues is HARD • Usually a DESIGN problem: Not Pay for Play • Using every new feature in your app • XML, LINQ, WPF, WCF, Serialization, Winforms, … • Initialize all subsystems at startup • GC Memory Are your Data Structures • Tend to be designed early • Hard to change later • Thus it Pays to Think about Memory Early!
Some Theory on the .NET GC Heap • Compacting Garbage Collector • Runtime Traces all Pointers to the GC Heap • Fast: Allocations under 85K just ‘bump a pointer’ • When heap full GC happens and memory is typically compacted • Object allocated on same thread together • .NET GC is a Generation Collector (3 Gens: 0, 1, 2) • Gen 0 • All memory allocated out of gen 0 • Ideally size of Gen 0 < L2 Cache size • GC of Gen 0 takes little time (e.g. .2 msec) • Gen 1 • Memory that survived 1 GC. Ideally #Gen0 = 10 x #Gen 1 • GC of Gen 1 takes longer but is modest (e.g. 1msec) • Gen 2 • All objects (including large objects). Can be very large. • GC of Gen 2 (for 20Meg Heap) = 160msec • Can take a noticeable amount of time (e.g. 8 msec / Meg) • Time depends on • Amount of memory surviving • Number of GC pointers in surviving memory • Fragmentation of Heap.
More .NET GC Heap Theory • GC heap looks like a Sawtooth • Typical Gen 2 Peak / Trough Ratio ~ 1.6 • Ratio mostly independent of heap size • Keep in mind no other fragmentation