1 / 21

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger , Kathryn McKinley * , Robert Blumofe, Paul Wilson. Department of Computer Sciences. * Department of Computer Science. Motivation. Parallel multithreaded programs becoming prevalent

jeaniem
Download Presentation

Hoard: A Scalable Memory Allocator for Multithreaded Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley*, Robert Blumofe, Paul Wilson Department of Computer Sciences *Department of Computer Science Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  2. Motivation Parallel multithreaded programs becoming prevalent web servers, search engines, database managers, etc. run on SMP’s for high performance often embarrassingly parallel Memory allocation is a bottleneck prevents scaling with number of processors Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  3. Assessment Criteria for Multiprocessor Allocators Speed competitive with uniprocessor allocators on one processor Scalability performance linear with the number of processors Fragmentation (= max allocated / max in use) competitive with uniprocessor allocators worst-case and average-case Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  4. Uniprocessor Allocators on Multiprocessors Fragmentation: Excellent Very low for most programs [Wilson & Johnstone] Speed & Scalability: Poor Heap contention a single lock protects the heap Can exacerbate false sharing different processors can share cache lines Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  5. Allocators cause false sharing! Cache lines can end up spread across a number of processors Practically all allocators do this Allocator-InducedFalse Sharing A cache line processor 1 processor 2 x1 = malloc(s); x2 = malloc(s); thrash… thrash… Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  6. Existing Multiprocessor Allocators Speed: One concurrent heap (e.g., concurrent B-tree): too expensive too many locks/atomic updates O(log n) cost per memory operation  Fast allocators use multiple heaps Scalability: Allocator-induced false sharing and other bottlenecks Fragmentation: P-fold increase or even unbounded Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  7. Multiprocessor Allocator I:Pure Private Heaps Pure private heaps:one heap per processor. malloc gets memoryfrom the processor's heap or the system free puts memory on the processor's heap Avoids heap contention Examples: STL, ad hoc (e.g., Cilk 4.1) processor 1 processor 2 x1= malloc(s) x2= malloc(s) free(x1) free(x2) x4= malloc(s) x3= malloc(s) = allocated by heap 1 = free, on heap 2 Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  8. How to Break Pure Private Heaps: Fragmentation Pure private heaps: memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3= malloc(s) free(x3) Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  9. Multiprocessor Allocator II:Private Heaps with Ownership Private heaps with ownership:free puts memory back on the originating processor's heap. Avoids unbounded memory consumption Examples: ptmalloc [Gloger], LKmalloc [Larson & Krishnan] processor 1 processor 2 x1= malloc(s) free(x1) x2= malloc(s) free(x2) Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  10. How to Break Private Heaps with Ownership:Fragmentation Private heaps with ownership:memory consumption can blowup by a factor of P. Round-robin producer-consumer: processor i allocates processor i+1 frees This really happens (NDS). processor 1 processor 2 processor 3 x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3=malloc(s) free(x3) Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  11. So What Do We Do Now? Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  12. The Hoard Multiprocessor Memory Allocator Manages memory in page-sized superblocks of same-sized objects - Avoids false sharing by not carving up cache lines - Avoids heap contention - local heaps allocate & free small blocks from their set of superblocks Adds a global heap that is a repository of superblocks When the fraction of free memory exceeds the empty fraction, moves superblocks to the global heap - Avoids blowup in memory consumption Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  13. Hoard Example processor 1 global heap Hoard:one heap per processor+ a global heap malloc gets memory from a superblock on its heap. free returns memory to its superblock. If the heap is “too empty”, it moves a superblock to the global heap. x1= malloc(s) …some mallocs …some frees free(x7) Empty fraction = 1/3 Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  14. Summary of Analytical Results Worst-case memory consumption: O(n log M/m + P) [instead of O(P n log M/m)] n = memory required M = biggest object size m = smallest object size P = number of processors Best possible: O(n log M/m) [Robson] Provably low synchronization in most cases Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  15. Experiments Run on a dedicated 14-processor Sun Enterprise 300 MHz UltraSparc, 1 GB of RAM Solaris 2.7 All programs compiled with g++ version 2.95.1 Allocators: Hoard version 2.0.2 Solaris (system allocator) Ptmalloc (GNU libc – private heaps with ownership) mtmalloc (Sun’s “MT-hot” allocator) Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  16. Performance: threadtest speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors) Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  17. Performance: Larson Server-style benchmark with sharing Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  18. Performance: false sharing Each thread reads & writes heap data Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  19. Fragmentation Results On most standard uniprocessor benchmarks,Hoard’s fragmentation was low: p2c (Pascal-to-C): 1.20 espresso: 1.47 LRUsim: 1.05 Ghostscript: 1.15 Within 20% of Lea’s allocator On the multiprocessor benchmarksand other codes: Fragmentation was between 1.02 and 1.24 for all but one anomalous benchmark (shbench: 3.17). Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  20. Hoard Conclusions Speed: Excellent As fast as a uniprocessor allocator on one processor amortized O(1) cost 1 lock for malloc, 2 for free Scalability: Excellent Scales linearly with the number of processors Avoids false sharing Fragmentation: Very good Worst-case is provably close to ideal Actual observed fragmentation is low Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

  21. Hoard Heap Details “Segregated size class” allocator Size classes are logarithmically-spaced Superblocks hold objects of one size class empty superblocks are “recycled” Approximately radix-sorted: Allocate from mostly-full superblocks Fast removal of mostly-empty superblocks 8 40 16 24 48 32 sizeclass bins radix-sorted superblock lists (emptiest to fullest) superblocks Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000

More Related