240 likes | 361 Views
Impulse Project DARPA Review – July 2000. University of Utah and University of Massachusetts at Amherst. Technology Trends. Disturbing trends (for a memory architect): Memory gap widening (CPUs improving 60%/year, DRAM only 7%) Internal CPU parallelism is escalating
E N D
Impulse ProjectDARPA Review – July 2000 University of Utah and University of Massachusetts at Amherst
Technology Trends • Disturbing trends (for a memory architect): • Memory gap widening (CPUs improving 60%/year, DRAM only 7%) • Internal CPU parallelism is escalating • Emerging applications with poor locality (multimedia, databases, …) • Cache size growing much faster than TLB reach • Ugly CPIs: Perl and Sites, OSDI 1996 • Possible solutions: • Bigger, deeper cache hierarchies • Better latency-tolerating CPU features (non-blocking cache, OOO, …) • Migrate computation to the DRAMs • Let software control how data is managed (Impulse)
Simple Example Problem • Sum of diagonal elements of dense matrix • Problems • Wasted bus bandwidth • Low cache utilization • Low cache hit ratio for (i = 0; i < n; i++) sum += A[i][i]; Physical Memory Cache Memory Controller Memory Bus
The Impulse Idea • What if software could do the following? • Improvements • No wasted bus bandwidth • Better cache utilization • Higher cache and TLB hit ratios Create diag[*] corresponding to A[*][*] for (i = 0; i < n; i++) sum += diag[i]; Physical Memory Cache Memory bus Memory Controller
virtualspace physical space real physical memory MMU/TLB Impulse MC Real physical space Shadow address space How? Add Extra Level of Mapping • Shadow address: “unused” physical address • MC maps shadow address to physical address • Applications configure MC through OS
Address Translations Physical Memory Virtual Memory MMU/TLB Conventional System Impulse System MMU/TLB diagonal Pseudo Virtual Memory Shadow Memory Physical Memory Virtual Memory Word-grained Page-grained
Impulse Features • Base-stride scatter/gather data • Walk columns or diagonals efficiently • Remap matrix tiles to contiguous memory without copying • Indirection vector accesses • Static vectors (e.g., perform A[index[i]] efficiently) • Dynamic cacheline assembly • Remap pages • Create superpages from disjoint base pages • No-copy page coloring • Aggressive controller-based prefetching • Prefetch data from DRAMs (sequential and pointer-directed)
Exploiting Impulse • Application asks OS to setup remapping • OS allocates free shadow configuration register • sets up dense “page table” that points to target data • downloads address of this page table to configuration register • OS allocates free shadow and virtual address space • maps application virtual addresses to shadow physical addresses • returns virtual address corresponding to remapped data to app Setup TLB translation (VA to shadow) Fine-grained remapping (if any) Remapped addresses pass through MC-TLB DRAM scheduler “collects” data Application accesses (dense) remapped data Use
Benchmarks • Fine-grained remapping benchmarks • Conjugate gradient (core of DARPA vision benchmark) • Ray tracing • Page-grained remapping benchmarks • SPEC95 (dynamic superpage promotion) • Compress (no-copy page coloring) • Prefetching benchmarks • SPECint 95 suite (3-15% performance improvement) • Synthetic tree microbenchmarks
Conjugate Gradient • Store logical sparse matrix A using Yale storage scheme • Data stores non-zero elements (much larger than P) • Row[i]indicates where theithrow begins in Data • Column[i] isthe column number of Data[i] Row A P => B 014 1 2 3 4 5 6 x Column 1 5 7 8 3 9 Data 1 2 3 4 5 6
Issues: Data and Col are large streams P reusable, but forced out of cache Poor L1 cache hit rates Interference in L2 cache Optimizing Conjugate Gradient Original Code Optimized Code Pi = remap_indirect(P, Col, n, …); for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * Pi[j]; b = sum; for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * P[Col[j]]; b = sum; Issues: • Indirect access to P[Col[j]] turned into sequential streaming access • No reuse on P now • Side effect: eliminate access to Col • Significant improvement to hit rates (both L1 and TLB)
Conjugate Gradient Results • Significant improvement in effective cache locality
Volume Rendering: Ray Tracing • Problem: Ray traversals are “random” memory accesses • Solution: Calculate addresses of rays as “indirection vector Access rays via Impulse-remapped data structure
Volume Rendering Results • A: rays follow natural memory layout (X axis) • B: rays perpendicular to natural memory layout (Z axis)
Coarse Grained Remappings • Page-grained remapping • Aggressive use of synthetic superpages • modified kernel TLB miss handler to detect pages responsible for frequent TLB misses • create superpage by page-grained remapping on memory controller • no copying, therefore can be far more aggressive • No-copy page coloring • Problem: conflicts in the physically-indexed L2 cache • Normal solution: copy to non-conflicting pages • Impulse solution: remap to non-conflict pages
Physical Addresses Shadow Addresses Virtual Addresses 0x04012000 0x80240000 0x00004000 0x80241000 0x00005000 0x80242000 0x00006000 0x06155000 0x80243000 0x00007000 0x12011000 0x40138000 Shadow-Backed Superpages • SPECint95 improves 5-20% • MTLB increases effective reach of CPU TLB • Superpage large and multiple arrays at compile time • at allocation time (cheapest) or dynamically
MMC-Based Prefetching • Idea: Prefetch data off of DRAMs into SRAM on MMC • Misprediction penalties significantly reduced • conflict misses due to cache capacity limitations • system bus bandwidth • Exploits “free” DRAM bandwidth at MMC level • higher aggregate DRAM bandwidth than cache or bus bandwidth • Reduces latency of accesses that hit in prefetch cache
Pointer-based Microbenchmarks • Random walk down tree w/ N-children per node • vary number of children from 1 (linked list) to 3 (trinary tree) • Baseline: compiler-directed prefetching • Impulse: MMC prefetches next nodes in tree (1-ahead) • allocate nodes in shadow region • tell MMC what offsets represent pointers Root ... Child1 Child2 ChildN ... Child1 Child2 ChildN
Pointer Prefetching Results • P1(N): singly-linked list, no prefetching • P3(C): triply-linked list, compiler-directed prefetching • P#(I): Impulse MMC-directed prefetching
Prototyping Status • Four stage prototype strategy • I: Slow conventional MMC • II: Fast conventional MMC • III: Impulse on an FPGA • IV: Impulse in an ASIC • Current Status: • Stage I complete (pictured) • Stage II imminent (final testing) • Stage III underway (3/01) • Stage IV next year (12/01)
Summary • Impulse Benefits • Higher memory bus utilization • Higher cache utilization • Turns sparse memory operations into dense ones • Range of optimizations • Fine-grained data remapping • Page-grained data remapping • Memory-based prefetching • Impact • Performance increase for small increase in cost • Does not require changes to CPUs, caches, or DRAMs
Questions? http://www.cs.utah.edu/impulse