450 likes | 562 Views
Storage Objects and Maps. Jeff Chase Duke University. A bstraction: files. Program A. Program B. w rite (“ def ”). o pen “/a/b”. w rite (“ abc ”). o pen “/a/b”. read. read. Library. Library. s ystem call trap/return. OS kernel. Names and layers. User view.
E N D
Storage Objects and Maps Jeff Chase Duke University
Abstraction: files Program A Program B write (“def”) open “/a/b” write (“abc”) open “/a/b” read read Library Library system call trap/return OS kernel
Names and layers User view notes in notebook file Application notefile: fd, byte range* fd File System bytes block# device, block # Disk Subsystem surface, cylinder, sector Add more layers as needed.
VM and files: the story so far Process (running program) Files on “disk” File system calls (e.g., open/read/write) globals Initialized from program file text Thread Program heap register context Anonymous Segments (zero-fill) stack Segments (regions) in Virtual Address Space
Running a program data code (“text”) constants initialized data virtual memory sections segments Mapped file regions Program When a program launches, the OS initializes a virtual memory to store the running program’s code and data. Typically it sets up the segments by “mapping” sections of the executable file.
Linux x86-64 layout: illustrated The details aren’t important. 0x400000 text r-x 0x600000 idata r-- Program 0x601000 data 0x1299000 rw- heap rw- [anon] N high addresses 0x2ba976c30000 lib r-x lib r-x libc.so shared library 0x7fff1373b000 stack 64K rw- [anon] 0x7fff1375c000
BSD/Mach vm_map Inside the kernel, Mac-OSX and BSD Unix represent a virtual address space as a list of segments/regions. Each segment has a backing object, where missing pages can be found on disk. “It’s just a file.” [http://people.engr.ncsu.edu/efg/501/f98/lectures/annotations/a19a.html]
Inside the VAS This is a data structure used in real OS kernels. The triangles represent VM objects (segments). The dots represent pages within segments. A segment may have any number of pages. Text and initialized static data are mapped from the executable file. (heap) “Vnode” refers to the inode for the underlying (backing) file. The stack and heap are zero-filled virtual memory: called anonymous because the backing file has no name (i.e., no links: it is destroyed if the process dies). [http://manrix.sourceforge.net/microkernelservice.htm]
Memory/storage abstractions • We have discussed three abstractions for memory/storage presented to user programs. • Variably-sized storage objects: named sequences of bytes • Heap blocks • Virtual memory segments • Files • How are these three abstractions different?
Memory/storage abstractions • We have discussed three abstractions for memory/storage presented to user programs. • Variably-sized storage objects: named sequences of bytes • Heap blocks • Virtual memory segments • Files • Issues • Naming and access • Storage management: allocation/deallocation • Can they grow? Persist? How big/small?
Files as “virtual storage” • Files have variable size. • They grow (when a process writes more bytes past the end) and they can shrink (e.g., see truncate syscall). • Most files are small, but most data is in large files. • Even though there are not so many large files, some are so large that they hold most of the data. • These “facts” are often true, but environments vary. • Files can be sparse, with huge holes in the middle. • Creat file, seek to location X, write 1 byte. How big is the file? • Files come and go; some live long, some die young. • How to implement diverse files on shared storage?
Memory/storage allocation and the parking lot analogy What is the resource to be allocated? Let’s be clear: • A heap manager allocates virtual address space for heap blocks within the heap segment. • Address space layout is a similar problem: how to fit variable-size segments within a linear address space? • Allocating storage for files and segments is different. • For files, the problem is to allocate storage for the file data on the “disk” (or other persistent storage device). • For segments, the problem is to allocate machine memory to store the segment data, or at least part of it. • These objects are “logically” contiguous, but their data need not be contiguous on disk or in machine memory: we can allocate storage in “pieces” and cobble them together.
Block maps Large storage objects (e.g., files, segments) may be mapped so they don’t have to be stored contiguously in memory or on disk. object Idea: use a level of indirection through a map to assemble a storage object from “scraps” of storage in different locations. The “scraps” can be fixed-size slots: that makes allocation easy because the slots are interchangeable (fixed partitioning). Fixed-size chunks of data or storage are called blocks or pages. map Examples: page tables that implement a VAS. One issue now is that each access must indirect through the map…
Using block maps File allocation is different from heap allocation. • Blocks allocated from a heap must be contiguous in the virtual address space: we can’t chop them up. • But files are accessed through e.g. read/writesyscalls: the kernel can chop them up, allocate space in pieces, and reassemble them. • Allocate in units of fixed-size logical blocks(e.g., 4KB, 8KB). • Each logical block in the object has an address (logical block numberor blockID): a block offset within the object. • Use a block map data structure. • Also works for other kinds of storage objects • Page tables, virtual storage volumes Index map with name, e.g., logical blockID #. Read address of the block from map entry.
Inodes and file block maps A file’s data blocks could be “anywhere” on disk. The file’s inodemaps them. Each entry of the map gives the disk location for the corresponding logical block. A fixed-size inode has a fixed-size block map. How to represent large files that have more logical blocks than can fit in the inode’s map? attributes Once upo n a time /nin a l and far far away ,/nlived t block map he wise and sage wizard. inode data blocks on disk An inode could be “anywhere” on disk. How to find the inode for a given file? Inodes are uniquely numbered: we can find an inode from its number.
To put it another way • Variable partitioning is a pain. We need it for heaps, and for other cases (e.g., address space layout). • But for files we can break the objects down into “pieces”. • When access to files is through an API, we can add some code behind that API to represent the file contents with a dynamic linked data structure (a map). • If the pieces are fixed-size (called pages or logical blocks), we can use fixed partitioning to allocate the underlying storage, which is efficient and trivial. • With that solution, internal fragmentation is an issue, but only for small objects. (Why?) • That approach can work for VM segmentstoo: we have VM hardware to support it (since the 1970s).
VM page maps Machine This picture is an example of a virtual memory on a 32-bit machine. Details vary. Global data and dynamic (“heap”) memory. A key role of the operating system is to manage the VM abstraction.
Virtual memory 0: 1: CPU N-1: Memory Page Table Virtual Addresses Physical Addresses 0: 1: P-1: Disk VMs (or segments) are storage objects described by maps. A page table is just a block map of one or more VM segments in memory. The hardware hides the indirection from user programs. CMU 15-213
Memory-mapped files In a modern OS, we can create a VM segment as a “window” on an existing file. Then we access the file through the virtual address space as an alternative to using read/write system calls. See the Unix mmap system call. [http://infohost.nmt.edu/~eweiss/222_book/222_book/0201433079/ch14lev1sec9.html]
Man mmap #include <sys/mman.h> void *mmap(void *addr, size_tlen, intprot, intflag, intfiledes, off_t off ); Returns: starting address of mapped region if OK, MAP_FAILED on error The mmap() system call causes the pages starting at addr and continuing for at most len bytes to be mapped from the object described by fd, starting at byte offset offset. If offset or len is not a multiple of the pagesize, the mapped region may extend past the specified range. Any extension beyond the end of the mapped object will be zero-filled. The addr argument is used by the system to determine the starting address of the mapping….
Memory as a cache Programs access storage objects through file APIs and VM abstraction. The OS kernel manages caching of pages (e.g., 4KB) in main memory. virtual address spaces data data files and filesystems, databases, other storage objects disk and other storage network RAM memory (frames) backing storage volumes (pages and blocks) Page read/write accesses
Memory/storage hierarchy Computing happens here, at the tip of the spear. The cores pull data up through the hierarchy into registers, and then push updates back down. small and fast (ns) registers caches L1/L2 In general, each layer is a cache over the layer below. off-core L3 off-chip main memory (RAM) big and slow (ms) You are here. off-module disk, other storage, network RAM Cheap bulk storage
Virtual addressing The machine allows a user process to access memory only by a valid translation in the page table. virtual memory (big?) machine memory (small?) Code running on a core addresses memory through virtual addresses. The machine translates virtual addresses via an in-memory page table. The OS controls the contents of the page table. The page table represents a functional mapping of virtual pages (VPNs) to page frames (PFNs) for resident pages. If a page is not resident in memory, then its page table entry is marked as invalid. The specific mechanisms for virtual address translation are machine-dependent.
Cartoon view of a page table process page table (map) This is an example. Any PFN may be used for any VPN. PFN x PFN y The map itself is just another data structure stored in memory. A protected CPU register holds the machine address of the current map. PFN i PFN i + offset VPN #i offset virtual address physical memory page frames Virtual page: a logical block in a segment. VPN: Virtual Page Number (a logical block number). Pageframe: a physical block in machine memory. PFN: Page Frame Number (a block pointer). PTE: Page Table Entry (an entry in the block map).
Virtual Address Translation Example only: a typical 32-bit architecture with 4KB pages. { 12 0 VPN offset virtual address Virtual address translation maps a virtual page number (VPN) to a page frame number (PFN) in machine memory: the rest is easy. address translation Deliver fault to OS if translation is not valid and accessible in requested mode. { + PFN machine address offset
Virtual Addressing: Under the Hood probe page table MMU access physical memory load TLB start here yes miss probe TLB access valid? raise exception hit no load TLB zero-fill OS no (first reference) page on disk? page fault? (lookup and/or) allocate frame fetch from disk kill yes legal reference illegal reference How to monitor page reference events/frequency along the fast path?
Fall 2014 Note • The remaining slides were not covered in class. • You should understand how to index a linear map: explained in more detail on “making change” slides. • The hierarchical map slides are “illustration only”, at least for now. You should know this, but we might not cover it. • Slides on inode maps and page/block caching will reappear when we do file systems later in the semester. • Please review the “what you should know about caching” slide so you are ready when we get to that material. • The “VM layout” slides are “illustration only”. You will not be tested on specific VM layouts (addresses, etc.).
DIV/MOD is easy (in base ten) • Suppose I have a pocketful of dimes and pennies. • Suppose I count out 58 cents. • How many dimes? • How many pennies? • Think of it this way: • My money is in blocks of 10 units. • The pennies are an offset in the block. • The math is easy: just shift/mask! DIV and MOD are trivial if the divisor is the base of the number system (e.g., 10).
“Making change” in binary or hex • To divide (DIV) by 2n, shift right by n bits. • Drop off the low-order n bits: C “>>” operator. • The result is your logical block number or VPN. • The MOD (remainder) is just the low-order n bits. • remainder = x AND (2n – 1): C “&” operator (bitwise AND) • That’s your offset. • Note: ALIGN is also easy, e.g.: • Round up to the nearest dime: “keep the change”. • Add 10, then DIV. 12 4K page = 212 bytes virtual address VPN offset 12 bits
Hierarchical block maps • We can extend the block map with one or more levels. • Now indexing is a multi-step process to traverse each level. • But if areas of the map are empty, we don’t need to allocate space for them. • We can leave whole branches of the tree empty. • A hierarchical map can represent objects that are large and/or sparse efficiently. • Used for both files/inodes and page tables.
IA32 [http://www.cs.rutgers.edu/~pxk/416/notes/09a-paging.html] X86-64 From “Porting NetBSD to the AMD x86-641: a case study in OS portability”, Frank van der Linden
Example: Windows/IA32 • Two-level block map (page table) structure reduces the space overhead for block maps in sparse virtual address spaces. • Many process address spaces are small: e.g., a page or two of text, a page or two of stack, a page or two of heap. • Windows provides a simple example of a hierarchical page table: • Each address space has a page directory (“PDIR”) • The PDIR is one page: 4K bytes, 1024 4-byte entries (PTEs) • Each PDIR entry points to a map page, which MS calls a “page table” • Each map page (“page table”) is one page with 1024 PTEs • Each PTE maps one 4K virtual page of the address space • Therefore each map page (page table) maps 4MB of VM: 1024*4K • Therefore one PDIR maps a 4GB address space, max 4MB of tables • Load PDIR base address into a register to activate the VAS
Two-level page table 32-bit virtual address Two 10-bit page table index fields (PT1, PT2) (10 bits represents index values 0-1023) Page table structure for Windows/IA32 2L= second level Step 2. Index 2L page table with PT2 Step 1. Index PDIR with PT1 virtual address 32 bits [from Tanenbaum]
Representing Large Files inode Classical Unix file systems inode == 128 bytes Each inode has 68 bytes of attributes and 15 block map entries that are the root of a tree-structured block map. direct block map indirect block double indirect block Suppose block size = 8KB 12 direct block map entries: map 96KB of data. One indirect block pointer in inode: + 16MB of data. One double indirect pointer in inode: +2K indirects. Maximum file size is 96KB + 16MB + (2K*16MB) + ... indirect blocks The numbers on this slide are for illustration only.
Skewed tree block maps • Inodes are the root of a tree-structured block map. • Like hierarchical page tables, but • These maps are skewed. • Low branching factor at the root. • The higher you go, the bushier they get. • Small files are cheap: just need the inode to map it. • …and most files are small. • Use indirect blocks for large files. • Requires another fetch for another level of map block • But the shift to a high branching factor covers most large files. • Double indirect blocks allow very large files.
Names and maps • Block maps and other indexed maps are common structure to implement “machine” name spaces: • sequences of logical blocks, e.g., virtual address spaces, files • process IDs, etc. • For sparse block spaces we may use a tree hierarchy of block maps (e.g., inode maps or 2-level page tables, later). • Storage system software is full of these maps. • Symbolic name spaces use different kinds of maps. • They are sparse and require matching more expensive. • Property list, key/value hash table • Trees of maps create nested namespaces, e.g., the file tree.
Post-note: what to know about maps • What is the space overhead of the maps? Quantify. • Understand how to lookup in a block map: logical block + offset addressing, arithmetic to find the map entry. • Understand hierarchical maps as a general concept with multiple instances (inodes, hierarchical page tables). • Design tradeoffs for hierarchical maps. • Pro: less space overhead for sparse spaces. • Con: more space overhead overall, e.g., if space is not sparse. • Con: more complexity, multiple levels of translation. • Skew: why better for small file files? What tradeoff? • No need to memorize the various parameters for inode maps or hierarchical page tables: concept only. But be sure to understand the concept.
Page/block caching • So now we understand files and segments as sequences of fixed-size pages (or logical blocks). • We can keep any subset of an object’s pages in memory while the object is in active use. • If we need a missing page from a file, we know where to get it: just look it up in the inode block map to find the location on disk. • E.g., read and write system calls operate on copies of file blocks in memory: memory is a writeback cache over the file system. • Similarly, if memory is too small to store all the segments, we store some of the pages in files until we need them again. • Modern operating systems manage memory as a unified cache of blocks/pages from files and segments.
Caching: terms/concepts to know Terms to know • cache index/directory • cache line/entry, associativity • cache hit/miss, hit ratio • spatial locality of reference • temporal locality of reference • eviction / replacement • write-through / writeback • dirty/clean You learned about these in computer architecture, using the hardware caches (L1-L3, TLB) as examples. But caching is very general and appears everywhere in systems. Caching: keep a copy of a selected subset of high-value data in storage or memory that is faster/cheaper to access than wherever we got the data from. Caching improves performance by reducing the need to access the slow/expensive place. We can build caches in software.
x64, x86-64, AMD64: VM Layout VM page map Source: System V Application Binary Interface AMD64 Architecture Processor Supplement 2005
linux11:~/www/cps310/c-samples> ./structs ^Z Suspended linux11:~/www/cps310/c-samples> ps x PID TTY STAT TIME COMMAND 23760 ? S 0:00 sshd: chase@pts/2 23761 pts/2 Ss 0:00 -tcsh 23866 pts/2 T 0:04 ./structs linux11:~/www/cps310/c-samples> cd /proc/23866 linux11:/proc/23866> cat maps 00400000-00401000 r-xp 00000000 00:1e 25122468 compsci310/c-samples/structs 00600000-00601000 r--p 00000000 00:1e 25122468 compsci310/c-samples/structs 00601000-00602000 rw-p 00001000 00:1e 25122468 compsci310/c-samples/structs 01299000-012ba000 rw-p 00000000 00:00 0 [heap] 2ba976c30000-2ba976c52000 r-xp 00000000 08:11 1062809 /lib/x86_64-linux-gnu/ld-2.15.so 2ba976c52000-2ba976c55000 rw-p 00000000 00:00 0 2ba976e52000-2ba976e53000 r--p 00022000 08:11 1062809 /lib/x86_64-linux-gnu/ld-2.15.so 2ba976e53000-2ba976e55000 rw-p 00023000 08:11 1062809 /lib/x86_64-linux-gnu/ld-2.15.so 2ba976e55000-2ba97700a000 r-xp 00000000 08:11 1062797 /lib/x86_64-linux-gnu/libc-2.15.so 2ba97700a000-2ba97720a000 ---p 001b5000 08:11 1062797 /lib/x86_64-linux-gnu/libc-2.15.so 2ba97720a000-2ba97720e000 r--p 001b5000 08:11 1062797 /lib/x86_64-linux-gnu/libc-2.15.so 2ba97720e000-2ba977210000 rw-p 001b9000 08:11 1062797 /lib/x86_64-linux-gnu/libc-2.15.so 2ba977210000-2ba977217000 rw-p 00000000 00:00 0 7fff1373b000-7fff1375c000 rw-p 00000000 00:00 0 [stack] 7fff137ef000-7fff137f0000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] linux11:/proc/23866> If we really want to know what the address space really looks like… [On 64-bit linux]
For your interest: • [http://stackoverflow.com/questions/1401359/understanding-linux-proc-id-maps] • Each row in /proc/$PID/maps describes a region of contiguous virtual memory in a process … • Each row has the following fields: • address - This is the starting and ending address of the region in the process's address space • permissions - This describes how pages in the region can be accessed. There are four different permissions: read, write, execute, and shared. If read/write/execute are disabled, a '-' will appear instead of the 'r'/'w'/'x'. If a region is not shared, it is private, so a 'p' will appear instead of an 's'. If the process attempts to access memory in a way that is not permitted, a segmentation fault is generated. Permissions can be changed using the mprotect system call. • offset - If the region was mapped from a file (using mmap), this is the offset in the file where the mapping begins. If the memory was not mapped from a file, it's just 0. • device - If the region was mapped from a file, this is the major and minor device number (in hex) where the file lives. • inode - If the region was mapped from a file, this is the file number. • pathname - If the region was mapped from a file, this is the name of the file. This field is blank for anonymous mapped regions. There are also special regions with names like [heap], [stack], or [vdso]. [vdso] stands for virtual dynamic shared object. It's used by system calls to switch to kernel mode. [see http://www.trilithium.com/johan/2005/08/linux-gate/] • You might notice a lot of anonymous regions. These are usually created by mmap but are not attached to any file. They are used for a lot of miscellaneous things like shared memory or buffers not allocated on the heap. For instance, ...the pthread library uses anonymous mapped regions as stacks for new threads.
“Classic Linux Address Space” N http://duartes.org/gustavo/blog/category/linux