Mysteries of Windows Memory Management Revealed

WCL406 Mysteries of Windows Memory Management Revealed Mark Russinovich Technical Fellow Windows Azure (created jointly with Dave Solomon)

Goals • Deep dive on: • Process virtual and physical memory usage • Operating system virtual and physical memory usage • Crisply define memory-related terminology • Highlight tools that reveal memory usage • Describe ‘dark spots’ in memory analysis counters and tools

Agenda • Virtual Memory • Address Space Usage • Process Commit • System Commit • Physical Memory • Working Sets • Paging Lists • Hard to Track Memory

Tools We’ll Use • Task Manager • Sysinternals Process Explorer • Sysinternals Vmmap • Process virtual and physical memory usage • Sysinternals Rammap • System physical memory usage • Sysinternals Testlimit • Test program to leak different kinds of memory Sysinternals tools are free at www.sysinternals.com

Virtual Memory

Memory Management Fundamentals • Windows has demand-paged memory management • Processes “demand” memory as needed • There is no swapping • A page is 4 KB (8 KB on Itanium) • Allocations must align on 64 KB boundaries • Large pages are available for improved TLB usage • x86: 4 MB • X64 and x86 PAE: 2 MB • Itanium: 16 MB • There is NO (will, almost no) connection between virtual memory and physical memory

32-bit x86 Address Space • 32-bits = 2^32 = 4 GB • /3GB and /USERVA can extend process address up to 3 GB • Process must be marked “large address space aware” to use memory above 2 GB Default 3 GB user space 3 GB Per-Process Space 2 GB Per-Process Space 2 GB System Space 1 GB System Space

64-bit Address Spaces • 64-bits = 2^64 = 17,179,869,184 GB • x64 today supports 48 bits virtual = 262,144 GB = 256 TB • IA-64 today support 50 bits virtual = 1,048,576 GB = 1024 TB • 64-bit Windows supports 44 bits = 16,384 GB = 16 TB x64 32-bit process on x64 8 TB Per-Process Space 4 GB Per-Process Space 8 TB System Space 8TB System Space

Virtual Address Space Components • Committed: in-use • Reserved: reserved for future use • Address space breakdown • Private (e.g. process heap) • Reserved or committed • Shareable (e.g. EXE, DLL, shared memory, other memory mapped files) • Reserved or committed • Free (not yet defined)

Why Reserve Memory? • Reserved memory lets an application lazily commit contiguous memory • Used for stack and heap expansion Stack Grows Down Committed Committed Thread Stack Guard Reserved Guard Reserved Before Expansion After Expansion

Viewing Address Space Breakdown • Task Manager only lets you see private bytes • Before Vista: column called “VM Size” • Vista and later: column called “Commit Size” • Process Explorer shows both virtual size and private bytes • Add 2 columns to process list • Virtual Size • Private Bytes • Run Testlimit twice • Testlimit -r • Testlimit -m • Note: if on 64-bit Windows, 32-bit Testlimit can grow to 4GB

Understanding Process Address Space Usage • Most virtual memory problems are due to a process leaking private committed memory • Heap, GC heap, language heaps (CRT) • Private Bytes only tells part of the story • Doesn’t account for shareable memory that’s not shared (e.g. DLLs loaded only by this process) • Fragmentation can be an issue • Address space can effectively be exhausted prematurely • Basic performance counters don’t provide enough information to troubleshoot Fragmented Address Space

Viewing Processes with VMMap • VMMap shows detailed breakdown of process address space: • Private process memory • Copy-on-write • Private (VirtualAlloc) • Heap and GC Heap • Stack • Shareable process memory • Image - executables • Shareable – shareable memory • Mapped File – memory mapped files • Page table – page table pages • Unusable – gap between allocation and next allocation boundary • Note that “shareable” types can have private commitment • Read/write pages in shared memory • Copy-on-write pages

Viewing Fragmentation • Fragmentation is visible by selecting Options->Show Free Regions, selecting the Free type, and sorting by size • Largest free block is largest allocation possible • Clickable fragmentation map in View->Fragmentation View • Run testlimit -t on 64-bit Windows • Threads need 256 KB 64-bit stack and 1 MB 32-bit stack

File Mappings • File mapping enables an application to read and write file data through memory operations • File mappings are used for • Image (.EXE and .DLL) loading: “Image” in VMMap • Data files access (e.g. NLS files): “Mapped File” in VMMap • “Pagefile-backed” shared memory: “Shareable” in VMMap • Entire file doesn’t have to be mapped • Allows for “windows” into the file Database.db Address Space

Tracing File Mapping with Process Monitor • Procmon can trace image loader activity

VMMap Differencing • Press F5 to refresh the view • VMMap keeps all snapshots • Use the timeline to select snapshots to compare

Tracing with VMMap • You can launch a process with profiling • Detours tracks virtual and heap activity

The System Commit Limit • System committed virtual memory must be backed either by physical memory or stored in the paging file • Sum of (most of) physical memory and current paging files • Allocations charged against the system commit limit: • Process private bytes • Pagefile-backed shared memory • Copy-on-write pages • Read/write file pages • System paged and nonpaged code and data • When limit is reached, virtual memory allocations fail • Processes may crash (or corrupt data)

Changing the System Commit Limit • You can increase the system commit limit by adding RAM or increasing the pagefile size • The system commit limit can grow if paging files configured to expand • So the system commit limit might be the current limit, not the maximum • Default configuration (“System Managed”): • Minimum: 1.5x RAM if RAM < 1 GB; RAM otherwise • Maximum: 3x RAM or 4 GB, whichever is larger • Maximum system commit limit should be based on system commit peak for extreme workload

Viewing System Commit Usage • Performance Counters: • Committed Bytes • Commit Limit • Task Manager • XP: commit charge labeled “PF Usage” • Vista: commit charge labeled “Page File” • Win7: commit charge labeled “Commit” • Vista and Win7 show commit limit after slash

Viewing the System Commit Limit • Process Explorer shows commit charge (with history), commit limit, and commit peak • No built-in tool shows peak any more

Exhausting the System Commit Limit • On 32-bit system, run “Testlimit –m” multiple times until system commit limit exhausted • On 64-bits, “Testlimit64 –m” will exhaust the system commit limit before its address space:

Sizing the Paging File • If you enough RAM to support your commit needs, why even have one? • System can page out unused, modified private pages vs keeping them in RAM • More RAM available for useful stuff • Many recommendations use a formula based on RAM (1.5x, 2x, etc.) • Actually, the more RAM, the smaller the paging file needed • Should be based on workload usage of committed virtual memory • Look at commit peak after workload has run • Pre-Vista: Task Manager • Vista+: Process Explorer • Apply a formula to that to give buffer (1.5x or 2x) • Make sure it’s big enough to hold a kernel crash dump

Physical Memory

Working Set List • All the physical pages “owned” by a process • E.g. the pages the process can reference without incurring a page fault • A process always starts with an empty working set • It then incurs page faults when referencing a page that isn’t in its working set • Hard fault: resolved from file on disk (paging file, mapped file) • Soft fault: resolved from memory newer pages older pages Working Set

Working Set • Each process has a default working set minimum and maximum • Can change with SetProcessWorkingSet • Working set minimum controls maximum number of locked pages (VirtualLock) • Minimum is also reserved from RAM as a guarantee to the process • Working set maximum is ignored • If there’s ample memory, process working set represents all the memory it has referenced (but not freed) • If memory is tight, working sets get trimmed

When memory manager decides the process is large enough, it give up pages to make room for new pages Local page replacement policy Means that a single process cannot take over all of physical memory unless other processes aren’t using it Page replacement algorithm is least recently accessed (pages are aged when available memory is low) Working Set Replacement To standby or modified page list Working Set

Working Set Breakdown • Consists of 2 types of pages: • Shareable (of which some may be shared) • Private • Four performance counters available: • Working Set Shareable • Working Set Shared (subset of shareable that are currently shared) • Working Set Private • Working Set Size (total of WS Shareable+Private) • Note: adding this up for each process overcounts shared pages • Caveats: • Working set does not include trimmed memory that is still cached • Shareable working set should be viewed as “private” if it’s not shared

Viewing Working Set with Task Manager • Displays private working set size • Calls it “Memory (Private Working Set)”

Viewing Working Set with Process Explorer • Process Explorer shows all the performance counters • Virtual Bytes • Private Bytes • WS Shareable Bytes • WS Shared Bytes • WS Private Bytes • Run Testlimit three times: • Testlimit -r 1024 -c 1 • Testlimit -m 1024 -c 1 • Testlimit -d 1024 -c 1 • Note how working set numbers don’t at all represent the process virtual memory usage

Viewing the Working Set with VMMap • Vmmap shows working set size of each component of address space • Also shows locked pages • Copy-on-write pages will show up as Private WS in shareable regions

How Copy-On-Write WorksBefore Process Address Space Process Address Space Physical memory Orig. Data Page 1 Orig. Data Page 2 Page 3

How Copy-On-Write WorksAfter Process Address Space Process Address Space Physical memory Orig. Data Page 1 Mod’d. Data Page 2 Page 3 Copy of page 2

Managing Physical Memory • System keeps unassigned physical pages on one of several lists • Free page list • Modified page list • Standby page lists (8 as of Vista & later) • Zero page list • ROM page list • Bad page list - pages that failed memory test at system startup • Lists are implemented by entries in the “PFN database” • Maintained as FIFO lists or queues

Paging Dynamics • New pages are allocated to working sets from the top of the free or zero page list • Pages released from the working set due to working set replacement go to the bottom of: • The modified page list (if they were modified while in the working set) • The standby page list (if not modified) • Decision made based on “D” (dirty = modified) bit in page table entry • Association between the process and the physical page is still maintained while the page is on either of these lists

Standby and Modified Page Lists • Modified pages go to modified (dirty) list • Avoids writing pages back to disk too soon • Unmodified pages go to standby (clean) lists • They form a system-wide cache of “pages likely to be needed again” • Pages can be faulted back into a process from the standby and modified page list • These are counted as page faults, but not page reads

Modified Page Writer • When modified list reaches certain size, modified page writer system thread is awoken to write pages out • Also triggered when memory is overcommitted (too few free pages) • Does not flush entire modified page list • Two system threads • One for mapped files, one for the paging file • Pages move from the modified list to the standby list • E.g. can still be soft faulted into a working set

Free and Zero Page Lists • Free Page List • Used for page reads • Private modified pages go here on process exit • Pages contain junk in them (e.g. not zeroed) • On most busy systems, this is empty • Zero Page List • Used to satisfy demand zero page faults • References to private pages that have not been created yet • When free page list has 8 or more pages, a priority zero thread is awoken to zero them • On most busy systems, this is empty too

demand zero page faults page read from disk or kernel allocations (“hard” page faults) modified page writer “global valid” faults working set replacement Private pages at process exit Paging Dynamics Standby PageLists Free PageList Zero Page List Bad Page List Working Sets zero page thread “soft” page faults Modified PageList

Viewing the Paging Lists with Task Manager • XP/2003: • Available = Standby + Zero + Free • System Cache = Standby + Modified + System Working Set • Vista/Server 2008: • Replaced Available with Free • Free + Zero list • System Cache relabeled Cached • Windows 7/Server 2008 R2 • Available put back

Viewing the Paging Lists with Process Explorer • Process Explorer shows each paging list • Click View->System Information

Total Process Private Memory Usage • Working Set size does not include: • Private memory on standby or modified lists • Page tables • Rammap shows this on Processes tab

Viewing Memory Usage with Rammap • In addition to showing size of paging lists, shows usage breakdown: • Process private • Mapped file • Shared memory • Page tables • Paged pool • Nonpaged pool • System PTE • Session private • Metafile • AWE • Driver locked • Kernel stack

Prioritized Standby Lists Pages removed Prioritized Standby Lists • In Vista & later, there are 8 prioritized standby lists • Pages are removed from lowest priority list first • Low memory priority process will keep re-using low priority pages • Higher priority information remains cached Pages added

SuperFetch™ • Superfetch proactively repopulates RAM with the most useful data • Sets priority of pages to optimal value, based the page history and other analysis that it performs • Takes into account frequency of page usage, usage of page in context of other pages in memory • Adapts application launch patterns, in chunks of 8 hours (times a day) and weekend vs weekday • Scenarios SuperFetch improves include • Resume from hibernate and suspend • Fast user switching • Performance after infrequent or low priority tasks execute • Application launch • Windows 7: Disabled if the OS is booted of an SSD

Memory Priority • Each thread has its own memory priority • 5: normal • 1: low • This determines which standby list is used for the page (when/if it arrives on the standby list) • Thread priority comes from process memory priority • Can be changed for process or individual thread • SetPriorityClass or SetThreadPriority “background mode”

Standby List Population • Priority 7 come from a static set (pre-trained at Microsoft) • Pre-populated at each boot • Includes pages related to user input that requires fast responsiveness (right-click, desktop properties, control panel, start menu, etc.) • Priority 6 are pages that SuperFetch considers important, or useful (will rarely get repurposed) • Priority 5 are standard user pages (memory priority 5) • Priority 1 are low priority user pages (memory priority 1) • Priority 0-4 may be Superfetch decayed, cache manager read-ahead and pagefault clustering

How Much of the Standby List has Been Consumed? • RAMMap shows the amount of memory repurposed off each standby list since boot:

Mysteries of Windows Memory Management Revealed