1.66k likes | 1.67k Views
Architecture And Software Support For Persistent and Vast Memory. Swapnil Haria PhD Final Defense. Persistent and Vast Memory is Here!. Intel Optane Memory. So are its challenges …. Executive Summary. Minimally Ordered Durable Datastructures (MOD)
E N D
Architecture And Software Support ForPersistent and Vast Memory Swapnil Haria PhD Final Defense 6/10/19
Persistent and Vast Memory is Here! Intel Optane Memory So are its challenges …
Executive Summary Minimally Ordered Durable Datastructures (MOD) • How to build fast, recoverable applications without changing hardware? • Reduce the number of expensive ordering points in software • Perform 60% faster than state-of-the-art STM library Hands-Off Persistence System (HOPS) • How to improve recoverable application performance with hardware changes? • Separate hardware primitives for frequent ordering points and rarer durability events • Performs 24% faster than state-of-the-art STM library Devirtualized Memory (DVM) • How to lower virtual memory overheads in systems with vast memory? • Eliminate expensive address translation on most accesses • Reduces overheads to less than 4% of execution time
Outline Background (8 slides) Minimally Ordered Durable Datastructures for Persistent Memory (New, 20 slides) Hands-Off Persistence System (Prelim, 2 or 10 or 20 slides) Devirtualized Memory for Heterogeneous Systems (Prelim, 2 or 8 or 20 slides)
Many other PM technologies may follow Phase Change Memory Resistive RAM Magnetic RAM Spin Transfer Torque RAM
Many other PM technologies may follow High Capacity Data Persistence Near-DRAM latencies Asymmetric Read/Write Latency
Promise of Persistent Memory Bigger DRAM Durable In-memory Datastructures Care about Memory NOT OUR FOCUS Faster Filesystems Care about Storage
Programming Challenges CPU 0 CPU 1 Private L1 Private L1 Durability Consistency (Failure) Atomicity Shared Last-Level Cache PM Controller DRAM Controller Persistence Domain Volatile Persistent
Background: Software Transactional Memory CACHE array BEGIN-TX value1 = array[index1] value2 = array[index2] LOG (index1, value1); FLUSH LOG; FENCE array[index1] = value2 FLUSH (&array[index1]) LOG (index2, value2); FLUSH LOG; FENCE array[index2] = value1 FLUSH (&array[index2]) FENCE END-TX Y X Y Y Index2=Y Index2=Y Index1=X Index1=X LOG PM array X Y LOG
Background: Software Transactional Memory CACHE array X Y X Y Index2=Y Index1=X LOG Use LOG to clean up the mess System Crash PM array X Y Y Index1=X Index2=Y LOG
Outline Background (8 slides) Minimally Ordered Durable Datastructures for Persistent Memory (New, 20 slides) - Analysis of Flushing Overheads on Optane - Building Blocks for Functional Shadowing - MOD Datastructures - Evaluation Hands-Off Persistence System (Prelim, 2 or 10 or 20 slides) Devirtualized Memory for Heterogeneous Systems (Prelim, 2 or 8 or 20 slides)
STM performance on Optane Execution Time Self-Normalized
Measuring Flush Overheads on Optane FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE …
MOD Datastructures: Goals Reduce ordering points with NO hardware modifications Library of recoverable datastructures: vector, map, queue, etc.
Eliminating Ordering Points How to provide atomicity and consistency with minimal ordering? Databases, Filesystems: Shadow Paging Key Idea: Avoid overwriting data!
Ordering in STM BEGIN-TX value1 = array[index1] value2 = array[index2] LOG (index1, value1); FLUSH LOG; FENCE array[index1] = value2 FLUSH (&array[index1]) LOG (index2, Y); FLUSH LOG; FENCE array[index2] = value1 FLUSH (&array[index2]) FENCE END-TX
Background: Shadow Paging CACHE array value1 = array[index1] value2 = array[index2] shadow = array // Create shadow copy shadow[index1] = value2 shadow[index2] = value1 FLUSH (shadow) FENCE // Application uses shadow subsequently X Y shadow X Y X Y X Y PM array X Y shadow
Déjà vu? Functional Datastructures! Purely Functional datastructures are immutable Implemented as efficient trees: Hash Array Mapped trie, RRBTree Copying overheads reduced by structural sharing array[8] updatedArray[8] swap Y X X Y
Shadow paging to minimize ordering constraints FunctionalShadowing Functional Datastructures to reduce shadow paging overheads
MOD usecases • One Update to One Datastructure (Atomic, Consistent, Durable) • Multiple Updates to One Datastructure (Atomic, Consistent, Durable) • One Update per Multiple Datastructures (Atomic, Consistent, Durable)
1: One Update to One Datastructure All Flushes Overlapped dsPtrShadow = dsPtr->Update(updateParams) FENCE dsPtr = dsPtrShadow • Update(arrayPtr, index, value) • // Atomic, Durable, Consistent with 1 FENCE arrayPtr array[8] shadowArray[8] update
2: Multiple Updates to One Datastructure dsPtrShadow1 = dsPtr->Update1(updateParams) dsPtrShadow2 = dsPtrShadow1->Update2(updateParams) FENCE dsPtr = dsPtrShadow2 Commit (dsPtr, dsPtrShadow1, dsPtrShadow2) All Flushes Overlapped dsPtr dsPtrShadow2 dsPtrShadow1 Update2 Update1 ds dsShadow1 dsShadow2
3: One Update per Multiple Datastructures ds1PtrShadow = ds1Ptr->Update1(updateParams1) ds2PtrShadow = ds2Ptr->Update2(updateParams2) FENCE Begin-TX { ds1Ptr = ds1PtrShadow ds2Ptr = ds2PtrShadow } End-TX Commit (ds1Ptr, ds1PtrShadow, ds2Ptr, ds2PtrShadow) All Flushes Overlapped More ordering points but short transaction
Evaluation Methodology Used C++ library of functional datastructures: https://github.com/arximboldi/immer Used off-the-shelf persistent memory allocator: https://github.com/hyrise/nvm_malloc.git Compared against Intel PMDK v1.5 (latest, Oct 2018) https://github.com/pmem/pmdk
Performance Comparison on Optane Execution Time Normalized to PMDK
Performance Comparison on Optane Execution Time Normalized to PMDK
MOD: Summary Minimizes number of ordering points in software Uses Shadow Paging instead of STM Requires changes in the application code
Outline Background (8 slides) Minimally Ordered Durable Datastructures for Persistent Memory (New, 20 slides) Hands-Off Persistence System (Prelim, 2 or 10 or 20 slides) Devirtualized Memory for Heterogeneous Systems (Prelim, 2 or 8 or 20 slides)
MOD: Summary HOPS: Goals Minimizes number of ordering points in software Minimizes cost of ordering points in hardware Uses Shadow Paging instead of STM Improves STM performance Requires changes in the application code Requires minimal changes in library code
DFENCE: Durability Fence Makes the stores preceding DFENCE durable Execution DFENCE ST A=1 ST B=2 ST C=1 Time Persistent Memory ST B=2 ST A=1
OFENCE: Ordering Fence Orders stores preceding OFENCE before later stores OFENCE Execution ST A=1 ST B=2 Time Persistent Memory ST B=2 ST A=1
Array Swap: HOPS // TX-Start value1 = arrayPtr->at(index1) value2 = arrayPtr->at(index2) LOG (index1, value1);ofenceFLUSH LOG; FENCE arrayPtr[index1] = value2 FLUSH (&arrayPtr[index1]) LOG (index2, value2);ofenceFLUSH LOG; FENCE arrayPtr[index2] = value1 FLUSH (&arrayPtr[index1]) dfence FENCE // TX-End
Base System CPU CPU Private L1 Private L1 Shared LLC + Loads Stores Loads + Stores PM Controller DRAM Controller Volatile Persistent
Base System + Persist Buffers CPU CPU Persist Buffer Persist Buffer Private L1 Private L1 Shared LLC + Loads Stores Loads + Stores PM Controller DRAM Controller Volatile Persistent 38
NVM Analysis HOPS Design 4% accesses to PM, 96% to DRAM 5-50 ordering points /transaction Order writes without flushing Volatile memory hierarchy (almost) unchanged Self-dependencies common Allows copies of same cacheline Correct, conservative method based on coherence Cross-dependencies rare
Performance Evaluation on gem5 simulator HOPS Ideal performance, unsafe on crash Baseline, uses Intel instructions (Lower is Better)
HOPS + MOD MOD reduces ofencesto allow greater flush overlap at dfence HOPS makes it easier to port functional datastructures to MOD
Outline Background (8 slides) Minimally Ordered Durable Datastructures for Persistent Memory (New, 20 slides) Hands-Off Persistence System (Prelim, 2 or 10 or 20 slides) Devirtualized Memory for Heterogeneous Systems (Prelim, 2 or 8 or 20 slides)
Virtual Memory hardware on CPUs CPUs have expensive hardware to mitigate VM overheads FA 8 entries 2M page 8-way SA 64 entries 4K page 4-way SA 64 entries 4K page 4-way SA 32 entries 2M page 4-way SA 4 entries 1G page L1 Data TLB L1 Instruction TLB 12-way SA 1536 entries 4K/2M pages 4-way SA 16 entries 1G page L2 TLB
Virtual Memory hardware on Accelerators? CPUs have expensive hardware to mitigate VM costs Cannot replicate in power/area-constrained accelerators Accelerator Compute Logic 4-way SA 4 entries 1G page 4-way SA 64 entries 4K page 4-way SA 32 entries 2M page 12-way SA 1536 entries 4K/2M pages
3Ps for Memory Management How do we get the best of both worlds?
Devirtualized Memory (DVM) (Usually) Allocate data such that its VA == PA! Exploit PA == VA to mostly skip translation
Memory Accesses with DVM BAD Permission Violation GOOD PA == VA UGLY PA != VA PA == VA? Translate VA Validate: Validate:× Validate:×? Load Exception Perms OK? Load
Performance Evaluation on gem5 simulator 4% overheads Lower is Better Unsafe Physical Addresses
Persistent and Vast Memory is Here! Intel Optane Memory We have some timely answers!