Architecture And Software Support For Persistent and Vast Memory

Architecture And Software Support ForPersistent and Vast Memory Swapnil Haria PhD Final Defense 6/10/19

Persistent and Vast Memory is Here! Intel Optane Memory So are its challenges …

Executive Summary Minimally Ordered Durable Datastructures (MOD) • How to build fast, recoverable applications without changing hardware? • Reduce the number of expensive ordering points in software • Perform 60% faster than state-of-the-art STM library Hands-Off Persistence System (HOPS) • How to improve recoverable application performance with hardware changes? • Separate hardware primitives for frequent ordering points and rarer durability events • Performs 24% faster than state-of-the-art STM library Devirtualized Memory (DVM) • How to lower virtual memory overheads in systems with vast memory? • Eliminate expensive address translation on most accesses • Reduces overheads to less than 4% of execution time

Outline Background (8 slides) Minimally Ordered Durable Datastructures for Persistent Memory (New, 20 slides) Hands-Off Persistence System (Prelim, 2 or 10 or 20 slides) Devirtualized Memory for Heterogeneous Systems (Prelim, 2 or 8 or 20 slides)

The Age of Persistent Memory has begun!

Many other PM technologies may follow Phase Change Memory Resistive RAM Magnetic RAM Spin Transfer Torque RAM

Many other PM technologies may follow High Capacity Data Persistence Near-DRAM latencies Asymmetric Read/Write Latency

Intel Optane DC Persistent Memory

Promise of Persistent Memory Bigger DRAM Durable In-memory Datastructures Care about Memory NOT OUR FOCUS Faster Filesystems Care about Storage

Programming Challenges CPU 0 CPU 1 Private L1 Private L1 Durability Consistency (Failure) Atomicity Shared Last-Level Cache PM Controller DRAM Controller Persistence Domain Volatile Persistent

Background: Software Transactional Memory CACHE array BEGIN-TX value1 = array[index1] value2 = array[index2] LOG (index1, value1); FLUSH LOG; FENCE array[index1] = value2 FLUSH (&array[index1]) LOG (index2, value2); FLUSH LOG; FENCE array[index2] = value1 FLUSH (&array[index2]) FENCE END-TX Y X Y Y Index2=Y Index2=Y Index1=X Index1=X LOG PM array X Y LOG

Background: Software Transactional Memory CACHE array X Y X Y Index2=Y Index1=X LOG Use LOG to clean up the mess System Crash PM array X Y Y Index1=X Index2=Y LOG

Outline Background (8 slides) Minimally Ordered Durable Datastructures for Persistent Memory (New, 20 slides) - Analysis of Flushing Overheads on Optane - Building Blocks for Functional Shadowing - MOD Datastructures - Evaluation Hands-Off Persistence System (Prelim, 2 or 10 or 20 slides) Devirtualized Memory for Heterogeneous Systems (Prelim, 2 or 8 or 20 slides)

STM performance on Optane Execution Time Self-Normalized

Measuring Flush Overheads on Optane FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE FLUSH FENCE …

Measuring Flush Overheads on Optane

MOD Datastructures: Goals Reduce ordering points with NO hardware modifications Library of recoverable datastructures: vector, map, queue, etc.

Eliminating Ordering Points How to provide atomicity and consistency with minimal ordering? Databases, Filesystems: Shadow Paging Key Idea: Avoid overwriting data!

Ordering in STM BEGIN-TX value1 = array[index1] value2 = array[index2] LOG (index1, value1); FLUSH LOG; FENCE array[index1] = value2 FLUSH (&array[index1]) LOG (index2, Y); FLUSH LOG; FENCE array[index2] = value1 FLUSH (&array[index2]) FENCE END-TX

Background: Shadow Paging CACHE array value1 = array[index1] value2 = array[index2] shadow = array // Create shadow copy shadow[index1] = value2 shadow[index2] = value1 FLUSH (shadow) FENCE // Application uses shadow subsequently X Y shadow X Y X Y X Y PM array X Y shadow

Déjà vu? Functional Datastructures! Purely Functional datastructures are immutable Implemented as efficient trees: Hash Array Mapped trie, RRBTree Copying overheads reduced by structural sharing array[8] updatedArray[8] swap Y X X Y

Shadow paging to minimize ordering constraints FunctionalShadowing Functional Datastructures to reduce shadow paging overheads

MOD usecases • One Update to One Datastructure (Atomic, Consistent, Durable) • Multiple Updates to One Datastructure (Atomic, Consistent, Durable) • One Update per Multiple Datastructures (Atomic, Consistent, Durable)

1: One Update to One Datastructure All Flushes Overlapped dsPtrShadow = dsPtr->Update(updateParams) FENCE dsPtr = dsPtrShadow • Update(arrayPtr, index, value) • // Atomic, Durable, Consistent with 1 FENCE arrayPtr array[8] shadowArray[8] update

2: Multiple Updates to One Datastructure dsPtrShadow1 = dsPtr->Update1(updateParams) dsPtrShadow2 = dsPtrShadow1->Update2(updateParams) FENCE dsPtr = dsPtrShadow2 Commit (dsPtr, dsPtrShadow1, dsPtrShadow2) All Flushes Overlapped dsPtr dsPtrShadow2 dsPtrShadow1 Update2 Update1 ds dsShadow1 dsShadow2

3: One Update per Multiple Datastructures ds1PtrShadow = ds1Ptr->Update1(updateParams1) ds2PtrShadow = ds2Ptr->Update2(updateParams2) FENCE Begin-TX { ds1Ptr = ds1PtrShadow ds2Ptr = ds2PtrShadow } End-TX Commit (ds1Ptr, ds1PtrShadow, ds2Ptr, ds2PtrShadow) All Flushes Overlapped More ordering points but short transaction

Evaluation Methodology Used C++ library of functional datastructures: https://github.com/arximboldi/immer Used off-the-shelf persistent memory allocator: https://github.com/hyrise/nvm_malloc.git Compared against Intel PMDK v1.5 (latest, Oct 2018) https://github.com/pmem/pmdk

Performance Comparison on Optane Execution Time Normalized to PMDK

MOD: Summary Minimizes number of ordering points in software Uses Shadow Paging instead of STM Requires changes in the application code

MOD: Summary HOPS: Goals Minimizes number of ordering points in software Minimizes cost of ordering points in hardware Uses Shadow Paging instead of STM Improves STM performance Requires changes in the application code Requires minimal changes in library code

DFENCE: Durability Fence Makes the stores preceding DFENCE durable Execution DFENCE ST A=1 ST B=2 ST C=1 Time Persistent Memory ST B=2 ST A=1

OFENCE: Ordering Fence Orders stores preceding OFENCE before later stores OFENCE Execution ST A=1 ST B=2 Time Persistent Memory ST B=2 ST A=1

Array Swap: HOPS // TX-Start value1 = arrayPtr->at(index1) value2 = arrayPtr->at(index2) LOG (index1, value1);ofenceFLUSH LOG; FENCE arrayPtr[index1] = value2 FLUSH (&arrayPtr[index1]) LOG (index2, value2);ofenceFLUSH LOG; FENCE arrayPtr[index2] = value1 FLUSH (&arrayPtr[index1]) dfence FENCE // TX-End

Base System CPU CPU Private L1 Private L1 Shared LLC + Loads Stores Loads + Stores PM Controller DRAM Controller Volatile Persistent

Base System + Persist Buffers CPU CPU Persist Buffer Persist Buffer Private L1 Private L1 Shared LLC + Loads Stores Loads + Stores PM Controller DRAM Controller Volatile Persistent 38

NVM Analysis HOPS Design 4% accesses to PM, 96% to DRAM 5-50 ordering points /transaction Order writes without flushing Volatile memory hierarchy (almost) unchanged Self-dependencies common Allows copies of same cacheline Correct, conservative method based on coherence Cross-dependencies rare

Performance Evaluation on gem5 simulator HOPS Ideal performance, unsafe on crash Baseline, uses Intel instructions (Lower is Better)

HOPS v. MOD

HOPS + MOD MOD reduces ofencesto allow greater flush overlap at dfence HOPS makes it easier to port functional datastructures to MOD

Virtual Memory hardware on CPUs CPUs have expensive hardware to mitigate VM overheads FA 8 entries 2M page 8-way SA 64 entries 4K page 4-way SA 64 entries 4K page 4-way SA 32 entries 2M page 4-way SA 4 entries 1G page L1 Data TLB L1 Instruction TLB 12-way SA 1536 entries 4K/2M pages 4-way SA 16 entries 1G page L2 TLB

Virtual Memory hardware on Accelerators? CPUs have expensive hardware to mitigate VM costs Cannot replicate in power/area-constrained accelerators Accelerator Compute Logic 4-way SA 4 entries 1G page 4-way SA 64 entries 4K page 4-way SA 32 entries 2M page 12-way SA 1536 entries 4K/2M pages

3Ps for Memory Management How do we get the best of both worlds?

Devirtualized Memory (DVM) (Usually) Allocate data such that its VA == PA! Exploit PA == VA to mostly skip translation

Memory Accesses with DVM BAD Permission Violation GOOD PA == VA UGLY PA != VA PA == VA? Translate VA Validate: Validate:× Validate:×? Load Exception Perms OK? Load

Performance Evaluation on gem5 simulator 4% overheads Lower is Better Unsafe Physical Addresses

Summary

Persistent and Vast Memory is Here! Intel Optane Memory We have some timely answers!

Architecture And Software Support For Persistent and Vast Memory

Architecture And Software Support For Persistent and Vast Memory

Presentation Transcript

Design and Software Architecture

Design and Software Architecture

Software Architecture and Design

Memorage : Emerging Persistent RAM based Malleable Main Memory and Storage Architecture

Design and Software Architecture

Kernel-Assisted Scheduling and Deadline Support for Software Transactional Memory

Compiler and Runtime Support for Efficient Software Transactional Memory

Software Architecture and Design

Design and Software Architecture

Firmware and Software Architecture for PIXEL

Hardware and Software Architecture

Loose-Ordering Consistency for Persistent Memory

SOFTWARE ARCHITECTURE AND DESIGN

Compiler and Runtime Support for Efficient Software Transactional Memory

SCIMA: Software Controlled Integrated Memory Architecture for HPC

Persistent Memory over Fabrics

Software for Surveying, Construction and Architecture

Compiler and Runtime Support for Efficient Software Transactional Memory

Software Architecture and Design

implementation and architecture support