1 / 16

Hardware Support for Dynamic Memory Management

Hardware Support for Dynamic Memory Management. J. Morris Chang Witawas Srisa-an Chia-Tien Dan Lo Illinois Institute of Technology Edward F. Gehringer North Carolina State University. The Problem. O-o applications make frequent requests for dynamic memory.

wilona
Download Presentation

Hardware Support for Dynamic Memory Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware Support for Dynamic Memory Management J. Morris Chang Witawas Srisa-an Chia-Tien Dan Lo Illinois Institute of Technology Edward F. Gehringer North Carolina State University

  2. The Problem • O-o applications make frequent requests for dynamic memory. • C++ programs do an order of magnitude more than C programs. • Most objects are abandoned quickly. • --> Much time used in memory mgt. • Up to 30% in C programs ... • Garbage collection has been optimized, but still takes time.

  3. Hardware-Implemented Allocation • Makes use of an allocation vector (A-vector) and a bit-flipper. address 0 1 2 3 4 5 6 7 the A-vector before the allocation 1 0 1 1 0 0 1 1 (a) Combinational logic (the complete binary tree) determines that there is enough free memory to fill the request for two blocks (b) The address of the free block is 100 . 2 (c) The bits at 100 and 101 are flipped. 2 2 the A-vector after the allocation 1 0 1 1 1 1 1 1

  4. The Complete Binary Tree • A binary tree of bits is used to locate the first free region combinationally. Level 0 Size 24 1 Level 1 Size 23 1 1 Level 2 Size 22 1 0 1 0 Level 3 Size 21 1 1 0 0 0 1 0 0 Size 20 Level 4 1 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 A-vector address 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  5. Keeping Track of Object Size • Meanwhile, the size bit-vector (S-vector) records the boundaries between objects. Complete Binary Tree Allocation bit-vector (A-vector) S-Unit (Size encoder) Size bit-vector (S-vector)

  6. Five Hardware-Implemented Instructions • h_malloc • mark • h_free • sweep • h_realloc • All are implemented in the Dynamic Memory Management Unit. • DMMU manages the heap

  7. The DMMU • Each entry contains three bit-vectors. • X-vector used for reallocation & g.c. A-vector S-vector X-vector h_malloc / h_free / h_realloc mark / sweep gc_ack O.S. sbrk/brk CPU DMMU object_size Kernel object_pointer

  8. The ALB • Each entry in the DMMU tracks the allocation status of a region of memory. • Compare with a TLB, which tracks the location of a region of virtual memory. • So, these entries make up the Allocation Lookaside Buffer. • Entries can be saved and fetched to A-, S-, and X- bitmaps.

  9. Steps in Allocation • Compare requested size with largest_available_size in each ALB entry. • Select an entry & pass requested size to CBT • CBT locates first available chunk. • Chunk is allocated using buddy system. • Unused words at end are returned to free memory. • Address of block is returned, and status changed to allocated. • S-vector is updated accordingly. Size (A1) Address pointer (A1) Complete Binary Tree ( CBT ) h_ malloc Allocation bit-vector (A bit-vector) (A2) (Size encoder) S-Unit (A3) Size bit-vector (S bit-vector)

  10. Steps in Deallocation • Deallocation is very similar to allocation. Address pointer (D1) Complete Binary Tree ( CBT ) h_ free Allocation bit-vector (A bit-vector) (D2) (Size encoder) S-Unit (D3) Size bit-vector Size boundaries (S bit-vector)

  11. Steps in Marking • Each live-object pointer sent to CBT, one after another. • Page # of object pointer selects a bit-vector. • Signal generated by CBT is latched in X-vector. Address pointer Complete Binary Tree ( CBT ) mark Auxiliary bit-vector Live-object pointers (X-vector)

  12. Steps in Sweeping • Bit-sweeper receives the sweep signal. • Size info from S-vector and liveness status from X-vector generate new alloc. status and largest_avail_size. Allocation bit-vector (A vector) (E2) (Size encoder) S-Unit (E2) (E1) Size bit-vector (S vector) (E1) sweep (E1) GC_ ack (E3) Bit-Sweeper/ X-Unit (C1) Auxiliary bit-vector (X vector)

  13. Putting it All Together Size (A1) Address pointer (A1) Complete Binary Tree ( CBT ) Address pointer (B1, D1) h_ malloc , h_ free, mark (A1, B1, D1) Allocation/ deallocation output (A1, B1) (D1) Allocation bit-vector (A vector) (A2,B2,E2) (Size encoder) S-Unit (A2,B2,E2) (E1) Size bit-vector Size boundaries (B1). (S vector) (C1, E1) h_ realloc (C1) / sweep (E1) GC_ ack (E3) Bit-Sweeper/ X-Unit (C1) (E1) Starting_address (C1) (C1) Auxiliary bit-vector Enable signal live object pointer Ending_address(C1) generator (X-vector) (C2) Reallocation Status (RS-Unit) A. Steps required for allocation B. Steps required for deallocation C. Steps required for reallocation D. Steps required for marking Reallocation Status (C2) E. Steps required for sweeping

  14. Memory Usage • Most schemes encode size information in objects themselves. • This is more efficient with large objects. • Bit-vector is more efficient with small objects. • If object contains 8 bytes for size and1 for marking, bitmap scheme more efficient when avg. size < 384 bytes. • Avg. object size for C++ & Java programs:  101 bytes.

  15. Performance Gain • ALB miss penalty. • Bit-vector length of 500 bits ( 64 bytes) gives 97% hit ratio. • This => ALB entry is 192 bytes. • 64-bit 100 MHz bus gives 800 MB/s. transfer rate. • => miss penalty is 96 cycles (192x400/800) • With ALB hit, it takes 2 cycles to allocate memory. • => avg. hw. malloc time is 4.82 cycles. • Software malloc varies from 51 to 900 cycles, with avg. 192. • In an application that spends 30% of time allocating, speedup would be 41%.

  16. Summary • O-o applications spend a lot of their time allocating memory. • To allocate in hardware, we use a bit-vector based approach. • Allocation/deallocation done combinationally using a complete binary tree on top of the bit-vector. • Yields speedup of > 40% on memory-intensive programs.

More Related