Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

SSMallocA Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

Background • Many-Core Era • Computers with tens of cores are available • Many-Thread Application • Server Program • Scientific Computation Program • … • Many Applications’ performance heavily relies on memory allocator

Allocator performance matters • Web server throughput with different memory allocators • *Taken from Facebook website

Is it a solved problem? glibc SFMalloc(PACT11) Scale Up jemalloc(BSDCan06) Streamflow(ISMM06)

Is it a solved problem? #Core Unstable Scale Up kernel contention User-level contention

The main problems in modern memory allocators • Unstable scalability • Critical path contention • Global data structure contention • Kernel contention • With 64 threads, SFMalloc spent a great amount of time in mmap calls. • Unstable locality • Kernel execution • Allocator data structure operation • Context switch • Unstable Latency • Algorithm complexity • Jemalloc use RB trees (O(log N)) internally. • Hardware details(pipeline, branch prediction, cache)

This paper #Core Stable Scale Up

Design of ssmalloc

Mechanism for object of different size • Small Object • Closely related to scalability • Handled in private heap • Large Objects • Forward to OS via mmap

Small object (<=64KB) management Thread N Thread 2 Thread 1 … Private Heap 1 Private Heap 2 Private Heap N Memory Chunks Global Pool OS

Memory Chunk • Basic unit of memory management • Contains multiple objects of the same size class Obj 1 … Obj N Header Private RW Shared R SharedW Avoid false sharing on allocator data structure

Memory Chunk (Cont.) • Same size • Cross size-class reuse • Easy metadata locating • Unaligned size (65536 + 256 Byte) • Mitigate cache conflict on header Header 256 Byte Data Area 65536 Byte cache

Private Heap Full Chunks Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks

Private Heap (Cont.) Hot Chunks Full Chunks Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks Cold Chunks

Global Pool Private Heap A Private Heap B Global reuse (Lock Free) Alloc new chunk(Lock Free) Raw Memory Pool Raw Memory Pool is Enlarged Exponentially to avoid mmap calls

Global Pool (Cont.) • Interact with OS Memory Amount SSMalloc (Time-directed reclamation) • Reduce VM management Calls Many other allocators (Space-directed reclamation) • Memory pages ping-pongs from user & kernel • Excessive VM management calls Time

How to free an object? • Problem: decide the size of memory object • Textbook solution: per object header • Easy to locate, Bad locality • Modern allocators: centralized metadata • Hard to locate (bitmap, hash table, radix tree…), Good locality H ? H H

How to free an object? • Problem: decide the size of memory object • SSMalloc: Unified header for small & large objects • All the object’s header is at the previous chunk boundary • Easy to locate (Align to chunk boundary), Good locality Small Objects Large Objects

Design summary • Scalability • Sync-free critical path • Local memory reuse • Lock-free global data structure • Excessive VM management calls avoidance(mmap, munmap) • … • Latency • Wait-free algorithm within private heap • Short critical path • Unified header • … • Locality • Locality-conscious memory chunk management • Allocator false-sharing avoidance • …

Evaluation

Evaluation • Platform • 8 Six-Core (2.4 GHz) AMD x64 system (48 cores in total) • 128 GB memory • Linux 3.2.10 • Other memory allocators • Glibc • TCMalloc from google-perftools 1.7 • jemalloc 2.1.2 • streamflow • SFMalloc

latency • Allocation intensive serial programs

Scalability • shbench performance

Locality • Wordcountfrom phoenix 2.0: cache miss

Map-reduce performance • Wordcountfrom phoenix 2.0

Conclusion • Analysis the performance problem of memory allocators • Explore the design space of memory allocator for many-thread applications on many-core systems • A prototype: SSMalloc • Low latency • Stable scalability • Good locality Thanks!

Why not modify kernel to improve mmap scalability? • Parallelize the VM management operations includes huge kernel code refactoring • Memory manager itself • Device driver • Apply a new memory allocator is much more easy and practical.

Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

Ran Liu ( Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

Presentation Transcript

Drug Discovery Grid -- A real grid application

Optimizing Web Search Using Social Annotations

Drug Discovery Grid -- A real grid application

NCTU GMBA Global MBA Program --First GMBA Degree Program in Taiwan--

Childhood lead poisoning in China

U.S. Energy Policy and Its Development Strategies

Duke University Shanghai Jiaotong University

HUGO: Hierarchical mUlti -reference Genome cOmpression tool for aligned short reads

Liying Zhang, Ph.D .

Cholesteatoma

Secretory otitis media

Expanding Awareness of Mental Health in Childhood and Adolescence

Generalized Spectral Characterization of Graphs: Revisited

Facial Nerve Disease

Bochao Liu Xi’an Jiaotong University Collaborators: V.Baru, J.Haidenbauer and C.Hanhart

Received his master degree in 1994 at Shanghai Second Medical University;

GONIOMETRY

Course Introduction

Introduction and progress of Rehabilitation

The 7th International Conference on Ubiquitination and Ubiquitin Protein

In-situ fabrication of functionally graded Al/Mg 2 Si by electromagnetic separation

Vision Computing: Segmetation