390 likes | 610 Views
Memory Management in the Kernel. Fred Kuhns Applied Research Laboratory Computer Science and Engineering Washington University in St. Louis. UNIX Memory Management. UNIX uses a demand paged virtual memory architecture may proactively prepage Memory is managed at the page level
E N D
Memory Management in the Kernel Fred Kuhns Applied Research Laboratory Computer Science and Engineering Washington University in St. Louis
UNIX Memory Management • UNIX uses a demand paged virtual memory architecture • may proactively prepage • Memory is managed at the page level • page frame == physical page • virtual page • Basic memory allocation responsibility of the page-level allocator CSE522– Advanced Operating Systems
Page-level Allocation • Kernel maintains a list of free physical pages. • Since kernel and user space programs use virtual memory address, the physical location of a page is not important • Pages are allocated from the free list • Two principal clients: • the paging system • the kernel memory allocator CSE522– Advanced Operating Systems
Memory allocation physical page Page-level allocator Kernel memory Allocator Paging system Network buffers Data structures temp storage process Buffer cache CSE522– Advanced Operating Systems
Kernel Memory management • Memory organization: • permanently mapped to same memory range of all processes • separate address space • up to 1/3 of kernel time may be spent copying data between user and kernel • user space can not write/read directly to kernel space • KMA uses a preallocated pool of physical pages CSE522– Advanced Operating Systems
Kernel Memory Allocation • Typical request is for less than 1 page • Originally, kernel used statically allocated, fixed size tables - what’s limiting about this? • Kernel requires a general purpose allocator for both large and small chunks of memory. • handles memory requests from kernel modules, not user level applications • pathname translation routine, STREAMS or I/O buffers, zombie structures, table table entries (proc structure etc) CSE522– Advanced Operating Systems
Requirements of the KMA • Minimize Waste: • utilization factor = requested/required memory • Useful metric that factors in fragmentation. • 50% considered good • KMA must be fast since extensively used • Simple API similar to malloc and free. • desirable to free portions of allocated space, this is different from typical user space malloc and free interface • Properly aligned allocations: for example 4 byte alignment • Support cyclical and bursty usage patterns • Interaction with paging system – able to borrow pages from paging system if running low CSE522– Advanced Operating Systems
Example KMA’s • Resource Map Allocator • Simple Power-of-Two Free Lists • The McKusick-Karels Allocator • The Buddy System • SVR4 Lazy Buddy Allocator • Mach-OSF/1 Zone Allocator • Solaris Slab Allocator CSE522– Advanced Operating Systems
Resource Map Allocator • Resource map is a set of <base,size> pairs that monitor areas of free memory • Initially, pool described by a single map entry = <pool_starting_address, pool_size> • Allocations result in pool fragmenting with one map entry for each contiguous free region • Entries sorted in order of increasing base address • Requests satisfied using one of three policies: • First fit– Allocates from first free region with sufficient space.Fasted, fragmentation is concern • Best fit – Allocates from smallest that satisfies request. May leave several regions that are too small to be useful • Worst fit - Allocates from largest region unless perfect fit is found. Goal is to leave behind larger regions after allocation CSE522– Advanced Operating Systems
Resource Map Allocator Example Operations: offset_t rmalloc(size), void rmfree(base, size) <0,1024> after: rmalloc(256), rmalloc(320), rmfree(256,128) <256,128> <576,448> after: rmfree(128,128) <128,256> <576,448> after:many more allocation and deallocations <128,32> <288,64> <544,128> <832,32> CSE522– Advanced Operating Systems
Resource Map -Good/Bad • Advantages: • simple, easy to implement • not restricted to memory allocation, any collection of objects that are sequentially ordered and require allocation and freeing in contiguous chunks. • Can allocate exact size within any alignment restrictions. Thus no internal fragmentation. • Client may release portion of allocated memory. • adjacent free regions are coalesced • Disadvantages: • Map may become highly fragmented resulting in low utilization. Poor for performing large requests. • Resource map size increases with fragmentation (i.e. one entry per free region) • static table will overflow • dynamic table needs it’s own allocator • Map must be sorted for free region coalescing. Sorting operations are expensive. • Requires linear search of map to find free region that matches allocation request. • Difficult to return borrowed pages to paging system. CSE522– Advanced Operating Systems
Resource map – final word • Poor performance is cited as the reason for not using resource maps as a general-purpose kernel memory allocator. <X3,Y> <X0,Y0> <X1,Y1> <X2,Y> <X4,Y> <X5,Y5> <X6,Y6> CSE522– Advanced Operating Systems
Simple Power of Twos • has been used to implement malloc() and free() in the user-level C library (libc). • Uses a set of free lists with each list storing a particular size of buffer. Buffer sizes are a power of two. • Each buffer has a one word header • when free, header stores pointer to next free list element • when allocated, header stores pointer to associated free list (where it is returned to when freed). Alternatively, header may contain size of buffer CSE522– Advanced Operating Systems
32 64 128 256 512 1024 Simple Power of Two Free List • Example free list • One word header per buffer (pointer) • malloc(X): size = roundup(X + sizeof(header)) • roundup(Y) = 2^n, where 2^(n-1) < Y <= 2^n • free(buf) must free entire buffer. freelistarr[] CSE522– Advanced Operating Systems
Simple Power of Twos – Pros/Cons • Advantages: • Simple and reasonably fast • eliminates linear searches and fragmentation -> Bounded time for allocations when buffers are available • familiar API • simple to share buffers between kernel modules since free’ing a buffer does not require knowing its size • Disadvantages: • Rounding requests to power of 2 results in wasted memory and poor utilization. Aggravated by requiring buffer headers since it is not unusual for memory requests to already be a power-of-two. • no provision for coalescing free buffers since buffer sizes are generally fixed. • no provision for borrowing pages from paging system although some implementations do this. • no provision for returning unused buffers to page allocator CSE522– Advanced Operating Systems
Simple Power of Two #define PT_BUFHDR_SZ 4 void *malloc (size) { int ndx = 0; /* free list index */ int bufsize = 1 << MINPOWER /* size of smallest buffer */ size += PT_BUFHDR_SZ ; /* Add for header */ assert (size <= MAXBUFSIZE); while (bufsize < size) { ndx++; bufsize <<= 1; } /* ndx is the index on the freelist array * where the buffer will be allocated from */ … } CSE522– Advanced Operating Systems
McKusick-Karels Allocator • Improved power of twos implementation • Managed Memory must be contiguous pages • All buffers within a page must be of equal size • Does not require buffer headers to indicate page size. When freeing memory, free(buff) simply masks of the lower order bits to get the page address which is used as an index into the kmemsizes array. • Add an allocated page usage array (kmemsizes[]) where a page (pg) is in one of three states: • free: kmemsizes[pg] contains pointer to next free page • Dividedinto buffers: kmemsizes[pg] == buffer size • Part of a buffer Spanning multiple pages: kmemsizes[first_pg] = buffer size, where first_page is the first page of the buffer CSE522– Advanced Operating Systems
McKusick-Karels Allocator • Disadvantages: • similar drawbacks to simple power-of-twos allocator • vulnerable to bursty usage patterns since no provision for moving buffers between lists • Advantages: • eliminates space wastage in common case where allocation request is a power-of-two • optimizes round-up computation and eliminates it if size is known at compile time CSE522– Advanced Operating Systems
McKusick-Karels freelistarr[] 32 128 64 512 F 32 F ... Power of Twos 32 64 128 256 512 1024 No buffer header vulnerable to bursty usage memory not returned to paging system kmemsizes[], managed page map Size of blocks allocated from pages, pointer to free pages CSE522– Advanced Operating Systems
McKusick-Karels – ExampleMacros #define NDX(size) \ (size) > 128 ? ((size) > 256 ? 4 : 3) \ : (size) > 64 ? 2 : (size) > 32 ? 1 : 0 #define MALLOC(space, cast, size, flags) \ { \ register struct freelisthdr *flh; \ if(size<=512 && (flh=freelistarr[NDX(size)])!=NULL) {\ space = (cast)flh->next; \ flh->next = *(caddr_t *)space; \ } else \ space = (cast)malloc (size, flags); \ } CSE522– Advanced Operating Systems
Buddy System • Allocation scheme combining Power-of-Two allocator with free buffer coalescing • binary buddy system: simplest and most popular form. Other variants may be used by splitting buffers into four, eight or more pieces. • Approach: create small buffers by repeatedly halving a large buffer (buddy pairs) and coalescing adjacent free buffers when possible. • Requests rounded up to a power of two CSE522– Advanced Operating Systems
Buddy System, Example • Minimum allocation size = 32 Bytes • Initial free memory size is 1024 • use a bitmap to monitor 32 Byte chunks • bit set if chunk is used • bit clear if chunk is free • maintain freelist for each possible buffer size • power of 2 buffer sizes from 32 to 512 • sizes = {32, 64, 128, 256, 512} • Initial one block = entire buffer CSE522– Advanced Operating Systems
1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 Buddy System in Action In-use Free 32 64 128 256 512 freelist 0 1023 256 384 448 512 640 768 B C D D’ F F’ E’ Bitmap (32 B chunks) allocate(256), allocate(128), allocate(64), allocate(128), release(C, 128) What happens when D is released? release (D, 64) 1 CSE522– Advanced Operating Systems
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 Buddy System: Releasing D In-use Free 32 64 128 256 512 freelist 0 1023 256 512 640 768 B F F’ E’ B’ 64 128 256 512 When D is freed the system coalesces D with D’ then C with C’ (D & D’). Bitmap (32 B chunks) 1 CSE522– Advanced Operating Systems
Buddy System • Advantages: • good job of coalesces adjacent free buffers • easy exchange of memory with paging system • can allocate new page and split as necessary • when coalescing results in a complete page, it may be returned to the paging system • Disadvantage: • performance • recursive coalescing is expensive with poor worst case performance • back to back allocate and release result in alternate between splitting and coalescing the same memory • poor programming interface: • release needs both buffer and size. • entire buffer must be released CSE522– Advanced Operating Systems
SVR4 Lazy Buddy Algorithm • Addresses the main problem with the buddy system: poor performance due to repetitive coalescing and splitting of buffers. • Under steady state conditions, the number of in-use buffers for each size remains relatively constant. Under these condition, coalescing offers no advantage. • Coalescing is necessary only to deal with bursty conditions where there are large fluctuations in memory demand. CSE522– Advanced Operating Systems
SVR4 Lazy Buddy Algorithm • Coalescing delay – time taken to either coalesce a single buffer with its buddy or determine its buddy is not free • Coalescing is recursive and doesn’t stop until a buffer is found which can not be combined with its buddy. Each release operation results in at least one coalescing delay • Solution: • defer coalescing until it is necessary – results in poor worst-case performance • lazy coalescing – intermediate solution CSE522– Advanced Operating Systems
Lazy Coalescing • Release operation has two steps • place buffer on free list making it available for use. locally free • mark as free in bitmap and attempt to coalesce with adjacent buffers. globally free • Buffers divided into classes. • Assume N buffers in a given class. N = A + L + G, where A = number of active buffers, L = number of locally free buffers and G = number of globally free buffers • Buffer class states defined by slack = N – 2L – G • slack >= 2: lazy –steady state, coalescing not necessary. • slack == 1: reclaiming – borderline consumption, coalescing needed. Buffe is marked as free and coalesced (if possible). • slack == 0: acelerated – non-steady state consumption, must coalesce faster. Buffer marked as free and coalesced. One additional delayed buffer is also coalesced. • Coalesced buffers are passed to he next larger buffer class. CSE522– Advanced Operating Systems
Lazy Coalescing • Improvement over basic buddy system • steady state all lists in lazy state and no time is wasted splitting and coalescing • worst case algorithm limits coalescing to no more than two buffers (two coalescing delays) • shown to have an average latency 10% to 32% better than the simple buddy system • greater variance and poorer worst case performance for the release routine CSE522– Advanced Operating Systems
MACH Zone Allocator • Used in Mach and DEC UNIX • fast memory allocation and garbage collection in the background • Each class of object (proc structure, message headers etc) assigned its own pool of free objects (aka a zone) • set of power-of-two zones for anonymous objects • memory allocated from page-allocator • page can only hold objects in the same zone • struct zone heads free list of objects for each zone • zone structs are allocated from a zone of zones CSE522– Advanced Operating Systems
MACH Zone Allocator • zone initialization: zinit( size, max, alloc, name) • size = object size in bytes • max = max size of zone in bytes • alloc = amount of memory to add when free list empty • name = string describing objects in zone • a zone struct is allocated where parameters are stored • active zone structures are kept on an linked list referenced by first_zone and last_zone. First element on list is the zone of zones. • allocations are very fast since object sizes are fixed. CSE522– Advanced Operating Systems
Mach Zone Garbage Collection • Garbage Collection happens in the background to free unused memory. • free unused pages • expensive, linear search, locks • keeps a zone page map referring to all pages allocated to zones. Each map contains • in_free_list = number of objects from that page on the free list. Calculated during garbage collection. • alloc_count = total number of objects from page assigned to zone. • Zone garbage collector: zone_gc()is invoked by swapper each time it runs. It walks through zone list and for each zone it makes two passes through the free list • first pass scanes free elements and increments in_free_list for each found. If in_free_list equals alloc_count then page can be freed. • second pass all objects in freed pages are removed from the zone’s free list. • after pass two it calls kmem_free() to free the pages CSE522– Advanced Operating Systems
MACH Zone Allocator • Advantages • fast and efficient with a simple API • obj = (void *)zalloc (struct zone* z); • void zfree (struct zone *z, void *obj); • objects are exact size • gc provides for memory reuse • Disadvantages • no provision for releasing part of an object • gc efficiency is an issue: • slow and must complete before other system run • complex and inefficient algorithm • since zone allocator is used by the memory manager • zone of zones statically allocated CSE522– Advanced Operating Systems
struct zone struct zone struct zone struct zone struct Obj X struct Obj X struct Obj Y struct Obj Y struct Obj Y Mach Zones zone of zones first zone last zone Object X zone Object Y zone CSE522– Advanced Operating Systems
Solaris Slab Allocator • Better performance and memory utilization than other approaches studied. Based on Mach’s Zone Allocator. Slab allocator is concerned with object state while the zone allocator is not • Focus on object reuse, HW cache utilization allocator footprint • Object Reuse (i.e. Object Caching!). • Allocator must perform 5 operations: • 1) allocate memory, 2) initialize object, 3) use object, 4) destruct object, 5) free memory • HW Cache Utilization • Due to typical buffer address distributions cache contention may occur resulting in its poor utilization • slab allocator addresses this • Footprint – portion of the hardware cache and TLB that is overwritten by the allocation itself. • large footprint can displace data in cache and TLB • small footprint desirable CSE522– Advanced Operating Systems
vnode vnode vnode vnode mbuf mbuf msgb msgb msgb msgb msgb proc proc proc Design of Slab Allocator cachep = kmem_cache_create (name, size, align, ctor, dtor); page-level allocator back end vnode cache proc cache mbuf cache msgb cache front end Objects in use by the kernel CSE522– Advanced Operating Systems
Slab Allocator Design • Organized as a collection of caches • mechanisms for adding or removing memory from a cache • Create object cache with object ctor and dtor. • Allocating and freeing objects • this interface does not use the ctor and dtor so the kernel should initialize objects before freeing. • on allocating or freeing the in-use count is updated eliminating the need for a two pass GC. • objp = kmem_cache_alloc (cachep, flags); • kmem_cache_free (cachep, objp); • Cache grows by allocating a “slab”, contiguous pages managed as a unit. • part of the memory is used for managing the cace the rest is divided into a set of free objects • each object as the cache’s ctor called for it • When page level allocator needs memory returned it calls mem_cache_reap (cachep). The cache’s dtor is called for freed objects CSE522– Advanced Operating Systems
Slab Organization Slab linked list Unused space Coloring area slab struct free active free active active free 32 Byte kmem_slab free list memory used for free list Coloring area - vary starting offsets, optimize HW cache and bus CSE522– Advanced Operating Systems
Summary • Limited space overhead (kmem_slab, free ptr) • service requests quickly (remove a pre-initialized object) • coloring scheme for better HW utilization • small footprint • simple garbage collection (cost shared across multiple requests – in_use count) • Increased management overhead compared to simpler methods (power of twos) CSE522– Advanced Operating Systems