A non-blocking approach on GPU dynamical memory management

A non-blocking approach on GPU dynamical memory management Joy Lee @ NVIDIA

Outline Introduce Buddy memory system Our parallel implementation Performance comparison Discussion

Fixed size memory (memory pool) • Ever fastest & simplest memory system • Free list (item = address) • Each item of free list records the available address to allocate. • Free list can be implement with queue, stack, list, or any data structure. • Allocate • Just take one item from free list • Free • Just return the address to free list. • Performance • Constant time on both allocation & free

Multi-lists memory … … … … … • For management on non-fixed size memory system, a natural extension from fixed size memory is multi-lists memory system • Free list • multi free lists of fixed size memory with different size (ex: twice size grow) • Allocate • Find the first free list with size larger than request size by arithmetic operation example: ceil(log2(size)) • Take one element from the target free list • Free • Find the correct free list to free • Return the address to the target free list. • Performance • Constant time on both allocation & free, since it is possible to find suitable free list with arithmetic operation instead of linear searching. • Drawback: waste memory

Buddy memory … • To avoid the wasting memory problem in multi-lists memory, it is natural to allocate memory from the direct upper layers (twice size) when the free list is empty, instead of pre-allocated memory in all free lists. • Free list • multi free lists of fixed size memory, with sizes growing up in power of 2 • Allocate • Find the first free list with size larger than request size • Take one element from the target free list • If the free list is empty , create pairs from upper list • Free • Find the correct free list to free (using records) • Return the address to the target free list. • If the buddy is also in the free list, then free to upper. • Performance • Constant time on both allocation & free

Buddy memory buddy this Good internal de-fragment The buddy address can be calculated by address XOR size Constant time operation O(h), where h = log2(max size/min size) is a constant.

Memory layers Lower layer Current layer Upper layer … • Just implement one class of single layer, other layers are instances with different size. • Lower layer • The memory layer with 1/2 size of current layer • Current layer • The allocating request layer • Upper layer • The memory layer with 2x size of current layer

Pair creation Memory from upper layer Memory to current layer Memory to current layer If the current free list is empty, it will allocate memory from upper allocator. Since the size of upper is 2x, it will create a pair of available memory into current free list. If there are N threads simultaneously allocate memory in current layer, of that the free list is empty, only N/2 threads shall allocate memory from upper layer.

Free Queue • The free list was implemented with queue, of which head can run over tail. • Head<Tail available memory (directly allocate from this free list) • Head=Tail empty free list • Head>Tail under available (require pair creation from upper layer) • Use the above states to determine which threads shall call pair_creation() from upper layer.

Parallel strategy (Alloc) Threads with allocation requests to this layer Head Tail New Head Available memory in free queue Need pair creation from upper layer • Each allocation requestor creates a socket to listen the address. • The socket was implemented on free queue. • atomicAdd(&head,1) creates a socket. • The output address can come from current free list or pair creation from upper free list.

Odd/Even Pair Creation Threads with allocation requests to this layer Head Tail New Head Pair Creations New Tail The under available threads will perform pair creations in odd/even loop until new tail >= new head to avoid the overhead of simultaneous pair creation.

Parallel strategy (Free) • Store the freed address to free list • Calculate the buddy address. • XOR(addr, size) • Check if the buddy is already in the free list. • Use hand shake algorithm for fast lookup • If YES, mark both elements in free list as N/A, then free the memory block into upper layer.

Hand shake Record address of memory Record index in free list Memory block • Hand shake • The freed memory record its index in free list • The free list record the freed memory address • Fast check if buddy memory address is in free list • Calculate buddy memory address (XOR) • Read the index from this address • Check if the address of this index in free list is equal to the buddy memory address.

Performance gridDim=512 blockDim=512 K20

Discussion Warp level group allocation Dynamic expanding free queue

Backup Slides

Slow atomicCAS() loop long ret=now; do{ now=ret; ret=atomicCAS(&head, now, now->next); }while(ret!=now);

A non-blocking approach on GPU dynamical memory management

A non-blocking approach on GPU dynamical memory management

Presentation Transcript

The FFT on a GPU

Async IO, Non Blocking IO, Blocking IO and Multithreading

Modeling GPU non-Coalesced Memory Access

A Non-Blocking, Contention-Friendly Skip List

Performance and Power Analysis on ATI GPU: A Statistical Approach

Understanding GPU Memory

Concurrent Non-blocking BVH Creation

Memory Management Issues in Non-Blocking Synchronization

Minimum Complexity Non-blocking Switching

Non-blocking I/O

Non-Blocking Communications

McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator

GPU Memory Model Overview

A Scalable, Non-blocking Approach to Transactional Memory

Non-blocking I/O

Blocking / Non-Blocking Send and Receive Operations

GPU Memory Details

Pertemuan 10 Non Blocking

Non-blocking Caches

A Novel Directory-Based Non-Busy, Non-Blocking Cache Coherence

Non-Blocking Communications

Memory Management (a)