160 likes | 170 Views
This lecture discusses case studies for virtual memory and cache, including topics such as alpha paged virtual memory, address mapping, cache bandwidth, prefetching, cache power consumption, and the alpha 21264 instruction hierarchy.
E N D
Lecture 17: Case Studies • Topics: case studies for virtual memory and cache • hierarchies (Sections 5.10-5.17)
Alpha Paged Virtual Memory • Each process has the following virtual memory space: seg0 kseg seg1 Reserved for User text, data Reserved for kernel Reserved for page tables • The Alpha uses a separate instruction and data TLB • The TLB entries can be used to map pages of different • sizes
Example Look-Up PTEs T L B Physical Memory Virtual Memory Virtual page abc Physical page xyz If each PTE is 8 bytes, location of PTE for abc is at virtual address abc/8 = lmn Virtual addr lmn physical addr pqr
Alpha Address Mapping Virtual address Unused bits Level 1 Level 2 Level 3 Page offset 21 bits 10 bits 10 bits 10 bits 13 bits Page table base register + + + PTE PTE PTE L1 page table L2 page table L3 page table 32-bit physical page number Page offset 45-bit Physical address
Alpha Address Mapping • Each PTE is 8 bytes – if page size is 8KB, a page can • contain 1024 PTEs – 10 bits to index into each level • If page size doubles, we need 47 bits of virtual address • Since a PTE only stores 32 bits of physical page number, • the physical memory can be addressed by at most 32 + offset • First two levels are in physical memory; third is in virtual • Why the three-level structure? Even a flat structure would • need PTEs for the PTEs that would have to be stored in • physical memory – more levels of indirection make it • easier to dynamically allocate pages
Bandwidth • Out of order superscalar processors can issue 4+ instrs • per cycle 2+ loads/stores per cycle caches must • provide low latency and high bandwidth • Effective caches memory bandwidth requirements are • usually low; unfortunately, memory bandwidth is easier • to improve • RDRAM improved memory bandwidth by a factor of • eight, but improved performance by less than 2% for • most applications and by 15% for some graphics apps • Bandwidth can help if you prefetch aggressively
Cache Bandwidth Interleaved cache cell L1 D 1 port L1 D 1 port Multi-ported cell Odd words Even words • Similar area to a 1-ported • cache • More complexity in routing • addresses/data • Slight penalty when both • words conflict for the same • bank L1 D 2 ports L1 D 1 port 2-cycle access time 3-cycle access time
Prefetching • High memory latency and cache misses are unavoidable • Prefetching is one of the most effective ways to hide • memory latency • Some programs are hard to prefetch for – unpredictable • branches, irregular traversal of arrays, hash tables, • pointer-based data structures • Aggressive prefetching can pollute the cache and can • compete for memory bandwidth • Prefetch design for: (i) array accesses, (ii) pointers
Stride Prefetching • Constant strides are relatively easy to detect • Keep track of last address fetched by a PC – compare • with current address to confirm constant stride • Every access triggers a fetch of the next word – in fact, • the prefetcher tries to stay ahead enough to entirely • hide memory latency • Prefetched words are stored in a buffer to reduce • cache pollution
Cache Power Consumption • Instruction caches can save on decode time and • power by storing instructions in decoded form (trace caches) • Memory accesses are power hungry – caches can also • help reduce power consumption
Alpha 21264 Instruction Hierarchy • When powered on, initialization code is read from an • external PROM and executed in privileged architecture • library (PAL) mode with no virtual memory • The I-cache is virtually indexed and virtually tagged – this • avoids I-TLB look-up for every access – correctness is not • compromised because instructions are not modified • Each I-cache block saves 11 bits to predict the index of • the next set that is going to be accessed and 1 bit to • predict the way – line and way prediction • An I-cache miss looks up a prefetch buffer and a 128-entry • fully-associative TLB before accessing L2
21264 Cache Hierarchy • The L2 cache is off-chip and direct-mapped (the 21364 • moves L2 on to chip) • Every L2 fetch also fetches the next four physical blocks • (without exceeding the page boundary) • L2 is write-back • The processor has a 128-bit data path to L2 and 64-bit • data path to memory
21264 Data Cache • The L1 data cache is write-back, virtually indexed, physically • tagged, and backed up by a victim buffer • On a miss, the processor checks other L1 cache locations • for a synonym in parallel with L2 look-up (recall two • alternative techniques to deal with the synonym problem) • No prefetching for data cache misses
21264 Performance • 21164: 8KB L1s and 96KB L2 ; 21264: 64KB L1 and off-chip 1MB L2 • The 21264 is out of order and can tolerate L1 misses speedup is a • function of 21164 L2 misses that are captured by 21264’s L2 • Commercial database/server applications stress the memory system much • more than SPEC/desktop applications
Sun Fire 6800 Server • Intended for commercial applications aggressive memory • hierarchy design • 8 MB off-chip L2 • wide buses going to L2 and memory for bandwidth • on-chip memory controller to reduce latency • on-chip L2 tags to save latency on a miss • ECC and parity bits for all external traffic to provide high reliability • Large store buffers (write caches) between L1 and L2 • Data prefetch engine that detects strides • Instr prefetch that stays one block ahead of decode • Two parallel TLBs: 128-entry 4-way and 16-entry fully-associative
Title • Bullet