270 likes | 538 Views
SSMalloc A Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability. Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.). Background. Many-Core Era Computers with tens of cores are available Many-Thread Application
E N D
SSMallocA Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)
Background • Many-Core Era • Computers with tens of cores are available • Many-Thread Application • Server Program • Scientific Computation Program • … • Many Applications’ performance heavily relies on memory allocator
Allocator performance matters • Web server throughput with different memory allocators • *Taken from Facebook website
Is it a solved problem? glibc SFMalloc(PACT11) Scale Up jemalloc(BSDCan06) Streamflow(ISMM06)
Is it a solved problem? #Core Unstable Scale Up kernel contention User-level contention
The main problems in modern memory allocators • Unstable scalability • Critical path contention • Global data structure contention • Kernel contention • With 64 threads, SFMalloc spent a great amount of time in mmap calls. • Unstable locality • Kernel execution • Allocator data structure operation • Context switch • Unstable Latency • Algorithm complexity • Jemalloc use RB trees (O(log N)) internally. • Hardware details(pipeline, branch prediction, cache)
This paper #Core Stable Scale Up
Mechanism for object of different size • Small Object • Closely related to scalability • Handled in private heap • Large Objects • Forward to OS via mmap
Small object (<=64KB) management Thread N Thread 2 Thread 1 … Private Heap 1 Private Heap 2 Private Heap N Memory Chunks Global Pool OS
Memory Chunk • Basic unit of memory management • Contains multiple objects of the same size class Obj 1 … Obj N Header Private RW Shared R SharedW Avoid false sharing on allocator data structure
Memory Chunk (Cont.) • Same size • Cross size-class reuse • Easy metadata locating • Unaligned size (65536 + 256 Byte) • Mitigate cache conflict on header Header 256 Byte Data Area 65536 Byte cache
Private Heap Full Chunks Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks
Private Heap (Cont.) Hot Chunks Full Chunks Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks Cold Chunks
Global Pool Private Heap A Private Heap B Global reuse (Lock Free) Alloc new chunk(Lock Free) Raw Memory Pool Raw Memory Pool is Enlarged Exponentially to avoid mmap calls
Global Pool (Cont.) • Interact with OS Memory Amount SSMalloc (Time-directed reclamation) • Reduce VM management Calls Many other allocators (Space-directed reclamation) • Memory pages ping-pongs from user & kernel • Excessive VM management calls Time
How to free an object? • Problem: decide the size of memory object • Textbook solution: per object header • Easy to locate, Bad locality • Modern allocators: centralized metadata • Hard to locate (bitmap, hash table, radix tree…), Good locality H ? H H
How to free an object? • Problem: decide the size of memory object • SSMalloc: Unified header for small & large objects • All the object’s header is at the previous chunk boundary • Easy to locate (Align to chunk boundary), Good locality Small Objects Large Objects
Design summary • Scalability • Sync-free critical path • Local memory reuse • Lock-free global data structure • Excessive VM management calls avoidance(mmap, munmap) • … • Latency • Wait-free algorithm within private heap • Short critical path • Unified header • … • Locality • Locality-conscious memory chunk management • Allocator false-sharing avoidance • …
Evaluation • Platform • 8 Six-Core (2.4 GHz) AMD x64 system (48 cores in total) • 128 GB memory • Linux 3.2.10 • Other memory allocators • Glibc • TCMalloc from google-perftools 1.7 • jemalloc 2.1.2 • streamflow • SFMalloc
latency • Allocation intensive serial programs
Scalability • shbench performance
Locality • Wordcountfrom phoenix 2.0: cache miss
Map-reduce performance • Wordcountfrom phoenix 2.0
Conclusion • Analysis the performance problem of memory allocators • Explore the design space of memory allocator for many-thread applications on many-core systems • A prototype: SSMalloc • Low latency • Stable scalability • Good locality Thanks!
Why not modify kernel to improve mmap scalability? • Parallelize the VM management operations includes huge kernel code refactoring • Memory manager itself • Device driver • Apply a new memory allocator is much more easy and practical.