330 likes | 767 Views
Smart Memory for Smart Phones Chris Clack University College London clack@cs.ucl.ac.uk Outline Target Architecture Problems Focus on Fragmentation Results from UT A fast allocator (not embedded) Doug Lea’s Allocator Can We Do Better? Overheads Results Target Architecture
E N D
Smart Memory for Smart Phones Chris Clack University College London clack@cs.ucl.ac.uk
Outline • Target Architecture • Problems • Focus on Fragmentation • Results from UT • A fast allocator (not embedded) • Doug Lea’s Allocator • Can We Do Better? • Overheads • Results
Target Architecture • Small hand-held integrated phone/PDA devices • Soft real-time, “open box”, constrained applications heap • Competition pressure for more, more flexible, and better (larger) applications
To compact: copy when nearly full Problems (1) live free TOP A free fragment • Memory overhead • Compaction delay
Problems (2) live free TOP To compact: do sliding compaction when nearly full • Compaction delay
Problems (3) live free FREE LIST To compact: do sliding compaction when allocation fails • Compaction delay
Focus on Fragmentation • What happens in real programs? • Great paper by Mark Johnstone and Paul Wilson (UT): • “The Memory Fragmentation Problem: Solved?”, M.Johnstone & P.Wilson, 1997 • Fragmentation experiments using real programs running on real data
Max live at any time Max Kb at any time Average lifetime of an allocated byte
#4 #3 e.g. %frag #4 = (value_at_3 – value_at_2) * 100 / value_at_2 MEASURE OF FRAGMENTATION
Johnstone & Wilson’s conclusion • The best free-list management policy • in terms of fragmentation behaviour • on real programs is BEST-FIT • (Knuth notwithstanding)
A Fast Best-Fit Allocator • IMPLICATION: use Best-fit allocation and we (maybe?) won’t ever need to compact • At least, compaction delays will be minimized • BUT: best-fit allocation is S-L-O-W • Worst-case: have to scan the entire free list • Let’s look at a widely-used best-fit allocator: Doug Lea’s malloc • (arguably) the fastest best-fit allocator
Boundary tag – used for coalescing Boundary tag Boundary tag
Sorted by size Worst case: all free blocks in one bin – reduces to O(n) search exact-fit bins Fixed-width bins W Costs time to sort
Can we do better? • Support boundary tags and coalescing • Simple Idea (1) (of 4): • Probability of fragmentation triggering compaction depends on RANGE of allocatable block sizes • Very large block alloc more likely to fail due to frags • Very small free blocks create frags • (NB if all blocks same size, fragmentation is zero!)
No need to sort • Restrict range of allocatable sizes and create an exact-fit table: … lb lb+1 lb+2 lb+3 ub-2 ub-1 ub Worst case: O(n) search for next highest occupied bin
… lb lb+1 lb+2 lb+3 ub-2 ub-1 ub • Old idea • Use an occupancy bitmap • If (ub-lb) = 31, bitmap is just one word • To search/allocate: read bitmap; AND with mask; find highest set bit; maybe modify bit and write 00110000000000000000000000000101
Problem • What if range is very large? • E.g. Nikhil wants to allocate blocks that vary from 2 words to 212 words • 212 different block sizes • Worst case = linear search of 128 bitmap words (128 reads + …) • Two solutions: • Use more efficient bitmapping • Use unconstrained hybrid scheme (see later)
More efficient bitmapping • Simple Idea (2) • Use a bitmap tree: • Requires 128 + 4 + 1 words • Requires worst case 5 reads, 3 tests for zero, 3 masks, 3 finds of greatest set bit, 3 modify&writes • Generally: O(log32 ((ub-lb)/32)) • (Depends what you are counting … but it is fast!) • Ten times faster than any other scheme we know
LIFO/FIFO? • Simple Idea (3) • Although J&W found no difference between LIFO/FIFO/AO best fit, this might be different for embedded apps • So far, we can only do LIFO • We can achieve FIFO if we double-link ALL free blocks into one big chain • Drawback – now free takes as long as malloc (but still O(log32 ((ub-lb)/32)))
Or for FIFO: search bitmap tree to the left , then follow link to next highest free block If requested size not available, for LIFO: search bitmap tree to the right Bitmap tree Freed blocks placed at heads of chains … lb lb+1 lb+2 lb+3 ub-2 ub-1 ub
Simple Idea (4) • We can trivially also support Worst-fit by adding a pointer that always refers to the biggest block • And this is where we put our wilderness block! • We have no data on fragmentation behaviour of worst-fit • If it turns out to be similar to best fit, it would be preferable because we would have O(1) alloc and O(log32 ((ub-lb)/32)) free.
max Bitmap tree … lb lb+1 lb+2 lb+3 ub-2 ub-1 ub W
Overheads • Dynamic per-block overhead • Depends on (ub-lb) – can be very small • Example (total 32 bits per live block): • 16 bit signed int for size and availability of current block • 16 bit signed int for size and availability of previous block • Could optimize for live block overhead: 1 bit in header + free blocks also hold size at end of block • But, if 4-byte aligned and ANY overhead per block, can’t do better than this! • Free blocks additionally need to hold two pointers • minimum block size = header + 2 pointers
Static overheads • Code • A few registers (e.g. max) • Data structures: • Bitmap tree: 133 words • Table: (ub-lb) words • NOTE • if (ub-lb=heap) then table size is the size of the heap! (same overhead as semi-space) • So we don’t want to use this scheme for large size ranges!!! – instead use a hybrid
Hybrid scheme • Most used range of block sizes: • Use the bitmap tree and exact-fit bins as described • Bigger block sizes: • These are all kept on the double-linked chain above the biggest exact-fit block. • Can use fixed-width bins like Lea, together with a separate bitmap tree, • We lose the worst-case property of the primary scheme
RESULTS • Re-run Johnstone and Wilson’s tests, using our allocator on their trace files
Memory required by gmalloc Memory required by new allocator Memory requested by the program Test 1 Memory requirement halved ! Roughly 5% fragmentation?
Memory required by gmalloc Memory required by new allocator Memory requested by the program Test 2
Memory required by gmalloc Memory required by new allocator Memory requested by the program Test 3
Memory required by gmalloc Memory requirements consistently halved! Fragmentation consistently ~ 5% (?) Memory required by new allocator Memory requested by the program Test 4
Status • Currently working with Symbian to conduct malloc-replacement trials using real smartphone applications