490 likes | 565 Views
This chapter provides a detailed explanation of various memory management techniques, including swapping, contiguous memory allocation, segmentation, and paging. It also discusses the structure of the page table and gives examples of different architectures. The chapter explores how translation is accomplished and the role of caching in memory management. Additionally, it introduces hashed and inverted page tables and explains how paging and segmentation can be combined for efficient memory allocation.
E N D
Chapter 9: Memory Management • Background • Swapping • Contiguous Memory Allocation • Segmentation • Paging • Structure of the Page Table • Example: The Intel 32 and 64-bit Architectures • Example: ARM Architecture
Objectives • To provide a detailed description of various ways of organizing memory hardware • To discuss various memory-management techniques, including paging and segmentation • To provide a detailed description of the Intel Pentium, which supports both pure segmentation and segmentation with paging
How is the translation accomplished? • What, exactly happens inside MMU? • One possibility: Hardware Tree Traversal • For each virtual address, takes page table base pointer and traverses the page table in hardware • Generates a “Page Fault” if it encounters invalid PTE • Fault handler will decide what to do • More on this next lecture • Pros: Relatively fast (but still many memory accesses!) • Cons: Inflexible, Complex hardware • Another possibility: Software • Each traversal done in software • Pros: Very flexible • Cons: Every translation must invoke Fault! • In fact, need way to cache translations for either case!
Caching Concept • Cache: a repository for copies that can be accessed more quickly than the original • Make frequent case fast and infrequent case less dominant • Caching underlies many of the techniques that are used today to make computers fast • Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc… • Only good if: • Frequent case frequent enough and • Infrequent case not too expensive • Important measure: Average Access time = (Hit Rate x Hit Time) + (Miss Rate x Miss Time)
µProc 60%/yr. (2X/1.5yr) CPU Processor-Memory Performance Gap:(grows 50% / year) DRAM 9%/yr. (2X/10 yrs) DRAM Why Bother with Caching? Processor-DRAM Memory Gap (latency) 1000 “Moore’s Law” (really Joy’s Law) 100 Performance 10 “Less’ Law?” 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time
Hashed Page Tables • Common in address spaces > 32 bits • The virtual page number is hashed into a page table • This page table contains a chain of elements hashing to the same location • Each element contains (1) the virtual page number (2) the value of the mapped page frame (3) a pointer to the next element • Virtual page numbers are compared in this chain searching for a match • If a match is found, the corresponding physical frame is extracted • Variation for 64-bit addresses is clustered page tables • Similar to hashed but each entry refers to several pages (such as 16) rather than 1 • Especially useful for sparse address spaces (where memory references are non-contiguous and scattered)
Hashed Page Table • Independent of size of address space, • Pro: • O(1) lookup to do translation • Requires page table space proportional to how many pages are actually being used, not proportional to size of address space – with 64 bit address spaces, this is a big win! • Con: • Overhead of managing hash chains, etc.
Inverted Page Table • Rather than each process having a page table and keeping track of all possible logical pages, track all physical pages • One entry for each real (physical) page of memory • Entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns that page • Address-space identifier (ASID) stored in each entry maps logical page for a particular process to the corresponding physical page frame. • Decreases memory needed to store each page table, but increases time needed to search the table when a page reference occurs
Inverted Page Table • Pro: • Decreases memory needed to store each page table • Con: • increases time needed to search the table • Use hash table to limit the search to one — or at most a few — page-table entries • One virtual memory reference requires at least two real memory reads: one for the hash table entry and one for the page table. • Associative registers (TLBs) can be used to improve performance. • But how to implement shared memory? • One mapping of a virtual address to the shared physical address
Paging + segmentation: best of both? • simple memory allocation, • easy to share memory, and • efficient for sparse address spaces Virtual address Physical address virt seg # virt page # offset phys frame# offset No page-table page-table base size error > yes Segment table Physical memory + Phys frame # Page table
Paging + segmentation • Questions: • What must be saved/restored on context switch? • How do we share memory? Can share entire segment, or a single page. • Example: 24 bit virtual addresses = 4 bits of segment #, 8 bits of virtual page #, and 12 bits of offset. Physical memory Segment table What do the following addresses translate to? 0x002070? 0x201016 ? 0x14c684 ? 0x210014 ? Page-table base Page-table size 0x2000 0x14 – – 0x1000 0xD – – 0x1000 0x6 0xb 0x4 … 0x2000 0x13 0x2a 0x3 … portions of the page tables for the segments
Multilevel translation • What must be saved/restored on context switch? • Contents of top-level segment registers (for this example) • Pointer to top-level table (page table) • Pro: • Only need to allocate as many page table entries as we need. • In other words, sparse address spaces are easy. • Easy memory allocation • Share at segment or page level (need additional reference counting) • Cons: • Pointer per page (typically 4KB - 16KB pages today) • Page tables need to be contiguous • Two (or more, if > 2 levels) lookups per memory reference
Another Major Reason to Deal with Caching • Cannot afford to translate on every access • At least two DRAM accesses per actual DRAM access • Or: perhaps I/O if page table partially on disk! • Even worse: What if we are using caching to make memory access faster than DRAM access??? • Solution? Cache translations! • Translation Cache: TLB (“Translation Look-aside Buffer”)
Probability of reference 0 2n - 1 Address Space Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y Why Does Caching Help? Locality! • Temporal Locality (Locality in Time): • Keep recently accessed data items closer to processor • Spatial Locality (Locality in Space): • Move contiguous blocks to the upper levels
Processor Control Tertiary Storage (Tape) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) On-Chip Cache Datapath Registers 10,000,000s (10s ms) Speed (ns): 1s 10s-100s 100s 10,000,000,000s (10s sec) Size (bytes): 100s Ks-Ms Ms Gs Ts Memory Hierarchy of a Modern Computer System • Take advantage of the principle of locality to: • Present as much memory as in the cheapest technology • Provide access at speed offered by the fastest technology
A Summary on Sources of Cache Misses • Compulsory(cold start): first reference to a block • “Cold” fact of life: not a whole lot you can do about it • Note: When running “billions” of instruction, Compulsory Misses are insignificant • Capacity: • Cache cannot contain all blocks access by the program • Solution: increase cache size • Conflict(collision): • Multiple memory locations mapped to same cache location • Solutions: increase cache size, or increase associativity • Two others: • Coherence (Invalidation): other process (e.g., I/O) updates memory • Policy: Due to non-optimal replacement policy
Set Select Data Select Block Address Block offset Index Tag How is a Block found in a Cache? • Index Used to Lookup Candidates in Cache • Index identifies the set • Tag used to identify actual copy • If no candidates match, then declare cache miss • Block is minimum quantum of caching • Data select field used to select data within block • Many caching applications don’t have data select field
31 9 4 0 Cache Tag Cache Index Byte Select Ex: 0x01 Ex: 0x00 Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 0 : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31 Review: Direct Mapped Cache • Direct Mapped 2N byte cache: • The uppermost (32 - N) bits are always the Cache Tag • The lowest M bits are the Byte Select (Block Size = 2M) • Example: 1 KB Direct Mapped Cache with 32 B Blocks • Index chooses potential block • Tag checked to verify block • Byte select chooses byte within block Ex: 0x50
Review: Set Associative Cache • N-way set associative: N entries per Cache Index • N direct mapped caches operates in parallel • Example: Two-way set associative cache • Cache Index selects a “set” from the cache • Two tags in the set are compared to input in parallel • Data is selected based on the tag result
31 8 4 0 Cache Tag Cache Index Byte Select Cache Data Cache Tag Valid Valid Cache Tag Cache Data Cache Block 0 Cache Block 0 : : : : : : 1 0 Mux Sel1 Sel0 Compare Compare OR Hit Cache Block Set Associative Cache Example
Review: Fully Associative Cache • Fully Associative: Every block can hold any line • Address does not include a cache index • Compare Cache Tags of all Cache Entries in Parallel • Example: Block Size=32B blocks • We need N 27-bit comparators • Still have byte select to choose from within block
31 4 0 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : Byte 1 = Byte 31 Byte 0 : = Byte 63 Byte 33 Byte 32 = = : : : = Fully Associative Cache example
32-Block Address Space: Block no. 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Fully associative: block 12 can go anywhere Block no. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Block no. Block no. Set 0 Set 1 Set 2 Set 3 Where does a Block Get Placed in a Cache? • Example: Block 12 placed in 8 block cache
Review: Which block should be replaced on a miss? • Easy for Direct Mapped: Only one possibility • Set Associative or Fully Associative: • Random • LRU (Least Recently Used) 2-way 4-way 8-waySize LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Review: What happens on a write? • Write through: The information is written to both the block in the cache and to the block in the lower-level memory • Write back: The information is written only to the block in the cache. • Modified cache block is written to main memory only when it is replaced • Question is block clean or dirty? • Pros and Cons of each? • WT: • PRO: read misses cannot result in writes • CON: Processor held up on writes unless writes buffered • WB: • PRO: repeated writes not sent to DRAM processor not held up on writes • CON: More complex Read miss may require write back of dirty data
Virtual Address Physical Address Yes No Save Result Data Read or Write (untranslated) Caching Applied to Address Translation TLB Physical Memory CPU • Question is one of page locality: does it exist? • Instruction accesses spend a lot of time on the same page (since accesses sequential) • Stack accesses have definite locality of reference • Data accesses have less page locality, but still some… • Can we have a TLB hierarchy? • Sure: multiple levels at different sizes/speeds Cached? Translate (MMU)
What Actually Happens on a TLB Miss? • Hardware traversed page tables: • On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) • If PTE valid, hardware fills TLB and processor never knows • If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards • Software traversed Page tables (like MIPS) • On TLB miss, processor receives TLB fault • Kernel traverses page table to find PTE • If PTE valid, fills TLB and returns from fault • If PTE marked as invalid, internally calls Page Fault handler • Most chip sets provide hardware traversal • Modern operating systems tend to have more TLB faults since they use translation for many things • Examples: • shared segments • user-level portions of an operating system
What happens on a Context Switch? • Need to do something, since TLBs map virtual addresses to physical addresses • Address Space just changed, so TLB entries no longer valid! • Options? • Invalidate TLB: simple but might be expensive • What if switching frequently between processes? • Include ProcessID in TLB • This is an architectural solution: needs hardware • What if translation tables change? • For example, to move page from memory to disk or vice versa… • Must invalidate TLB entry! • Otherwise, might think that page is still in memory!
TLB Cache Memory CPU What TLB organization makes sense? • Needs to be really fast • Critical path of memory access • In simplest view: before the cache • Thus, this adds to access time (reducing cache speed) • Seems to argue for Direct Mapped or Low Associativity • However, needs to have very few conflicts! • With TLB, the Miss Time extremely high! • This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time) • Thrashing:continuous conflicts between accesses • What if use low order bits of page as index into TLB? • First page of code, data, stack may map to same entry • Need 3-way associativity at least? • What if use high order bits as index? • TLB mostly unused for small programs
Virtual Address Physical Address Dirty Ref Valid Access ASID 0xFA00 0x0003 Y N Y R/W 34 0x0040 0x0010 N Y Y R 0 0x0041 0x0011 N Y Y R 0 TLB organization: include protection • How big does TLB actually have to be? • Usually small: 128-512 entries • Not very big, can support higher associativity • TLB usually organized as fully-associative cache • Lookup is by Virtual Address • Returns Physical Address + other info • What happens when fully-associative is too slow? • Put a small (4-16 entry) direct-mapped cache in front • Called a “TLB Slice” • Example for MIPS R3000:
Oracle SPARC Solaris • Consider modern, 64-bit operating system example with tightly integrated HW • Goals are efficiency, low overhead • Based on hashing, but more complex • Two hash tables • One kernel and one for all user processes • Each maps memory addresses from virtual to physical memory • Each entry represents a contiguous area of mapped virtual memory, • More efficient than having a separate hash-table entry for each page • Each entry has base address and span (indicating the number of pages the entry represents)
Oracle SPARC Solaris (Cont.) • TLB holds translation table entries (TTEs) for fast hardware lookups • A cache of TTEs reside in a translation storage buffer (TSB) • Includes an entry per recently accessed page • Virtual address reference causes TLB search • If miss, hardware walks the in-memory TSB looking for the TTE corresponding to the address • If match found, the CPU copies the TSB entry into the TLB and translation completes • If no match found, kernel interrupted to search the hash table • The kernel then creates a TTE from the appropriate hash table and stores it in the TSB, Interrupt handler returns control to the MMU, which completes the address translation.
Example: The Intel 32 and 64-bit Architectures • Dominant industry chips • Pentium CPUs are 32-bit and called IA-32 architecture • Current Intel CPUs are 64-bit and called IA-64 architecture • Many variations in the chips, cover the main ideas here
Example: The Intel IA-32 Architecture • Memory management in IA-32 systems is divided into two components—segmentation and paging • The CPU generates logical addresses, which are given to the segmentation unit. • The segmentation unit produces a linear address for each logical address. • The linear address is then given to the paging unit, which in turn generates the physical address in main memory. • The segmentation and paging units form the equivalent of the memory-management unit (MMU).
IA-32 Segmentation • Each segment can be 4 GB • Up to 16 K segments per process • Divided into two partitions • First partition of up to 8 K segments are private to process (kept in local descriptor table (LDT)) • Second partition of up to 8K segments shared among all processes (kept in global descriptor table (GDT)) • Each entry in the LDT and GDT consists of an 8-byte segment descriptor with detailed information about a particular segment, including the base location and limit of that segment.
IA-32 Segmentation • The logical address is a pair (selector, offset), where the selector is a 16-bit number: • s designates the segment number, g indicates whether the segment is in the GDT or LDT, and p deals with protection. • The offset is a 32-bit number specifying the location of the byte within the segment in question. • The machine has six segment registers, allowing six segments to be addressed at any one time by a process. • It also has six 8-byte microprogram registers to hold the corresponding descriptors from either the LDT or the GDT. • This cache lets the Pentium avoid having to read the descriptor from memory for every memory reference.
Intel IA-32 Segmentation • The linear address on the IA-32 is 32 bits long • The segment register points to the appropriate entry in the LDT or GDT. • The base and limit information about the segment in question is used to generate a linear address. • First, the limit is used to check for address validity. • If the address is not valid, a memory fault is generated, resulting in a trap to the operating system. • If it is valid, then the value of the offset is added to the value of the base, resulting in a 32-bit linear address.
Logical to Physical Address Translation in IA-32 • The IA-32 architecture allows a page size of either 4 KB or 4 MB. • For 4-KB pages, IA-32 uses a two-level paging scheme in which the division of the 32-bit linear address is as follows: • The 10 high-order bits reference an entry in the outermost page table, called the page directory. • The CR3 register points to the page directory for the current process. • The page directory entry points to an inner page table that is indexed by the contents of the innermost 10 bits in the linear address. • Finally, the low-order bits 0–11 refer to the offset in the 4-KB page pointed to in the page table.
Intel IA-32 Paging Architecture • One entry in the page directory is the Page_Size flag, which if set, indicates that the size of the page frame is 4 MB and not the standard 4 KB. • If this flag is set, the page directory points directly to the 4-MB page frame, bypassing the inner page table; and • the 22 low-order bits in the linear address refer to the offset in the 4-MB page frame.
Intel IA-32 Paging Architecture • To improve the efficiency of physical memory use, IA-32 page tables can be swapped to disk. • In this case, an invalid bit is used in the page directory entry to indicate whether the table to which the entry is pointing is in memory or on disk. • If the table is on disk, the operating system can use the other 31 bits to specify the disk location of the table. • The table can then be brought into memory on demand.
Intel IA-32 Page Address Extensions • 32-bit address limits of 4GB, led Intel to create page address extension (PAE), allowing 32-bit processors to access more than 4GB of physical address space • Paging went to a 3-level scheme • Top two bits refer to a page directory pointer table • Page-directory and page-table entries increased from 32-bits to 64-bits in size, which allowed the base address of page tables and page frames to extend from 20 to 24 bits • Net effect is increasing address space to 36 bits – 64GB of physical memory; operating system support is required to use PAE
Intel x86-64 • Current generation Intel x86 architecture • 64 bits address space yields an astonishing 264 bytes of addressable memory > 16 quintillion (16 exabytes) • In practice only implement 48 bit addressing • Page sizes of 4 KB, 2 MB, 1 GB • Four levels of paging hierarchy • Can also use PAE so virtual addresses are 48 bits and physical addresses are 52 bits
64-bit ARMv8 Architecture • The ARMv8 has three different translation granules: 4 KB, 16 KB, and 64 KB. • Each translation granule provides different page sizes, as well as larger sections of contiguous memory, known as regions. • For 4-KB and 16-KB granules, up to four levels of paging may be used, with up to three levels of paging for 64-KB granules. • ARMv8 address structure for the 4-KB translation granule with up to four levels of paging (Notice that only 48 bits are currently used):
64-bit ARMv8 Architecture • The four-level hierarchical paging structure for the 4-KB translation granule: • TTBR register is the translation table base register and points to the level 0 table for the current thread • If all four levels are used, the offset (bits 0–11) refers to the offset within a 4-KB page. • Table entries for level 1 and level 2 may refer either to another table or to a 1-GB region (level-1 table) or 2-MB region (level-2 table).
64-bit ARMv8 Architecture • The ARM architecture supports two levels of TLBs • Inner level has two micro TLBs (one data, one instruction) • The micro TLB supports ASIDs as well • At the outer level is single main TLB • Address translation begins at the micro-TLB level. • In the case of a miss, the main TLB is then checked. • If both TLBs yield misses, a page table walk must be performed in hardware.