300 likes | 412 Views
Avoiding Initialization Misses to the Heap. Jarrod Lewis, Bryan Black, and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison Intel Labs. http://www.ece.wisc.edu/~pharm. Motivation. Memory bandwidth is expensive
E N D
Avoiding Initialization Misses to the Heap Jarrod Lewis, Bryan Black, and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison Intel Labs http://www.ece.wisc.edu/~pharm
Motivation • Memory bandwidth is expensive • Shouldn’t waste on useless traffic • Can be put to better use • Multithreading, prefetching, MLP, etc. • Search and destroy useless traffic • Focus of this talk: heap initialization • Detect and optimize initialization of newly allocated memory 23% of misses in 2MB cache are invalid Avoiding Initialization Misses to the Heap – Mikko Lipasti
Dynamically Allocated Memory Allocated Invalid • Invalid memory need not be transferred • Provide interface that expresses this directly? malloc() Unallocated Invalid Heap Space free() initializing store free() Allocated Valid load or store Avoiding Initialization Misses to the Heap – Mikko Lipasti
Talk Outline • Motivation • Analysis of Heap Behavior • Detecting Initializing Writes • Performance Analysis • Conclusions Avoiding Initialization Misses to the Heap – Mikko Lipasti
Allocation Analysis • Two main modes • Single dominant allocation (up to 100MB) or • Numerous moderate allocations • Initialization of allocations • 88% initialized with store miss • Little temporal reuse of free’d memory • Phase behavior • Start of program often dominates • Even SPEC has counterexamples (gcc, vortex) Avoiding Initialization Misses to the Heap – Mikko Lipasti
Cache Miss Behavior • Init stores cause up to 60% of misses (avg 23%) • These are 35% of all compulsory misses Avoiding Initialization Misses to the Heap – Mikko Lipasti
Talk Outline • Motivation • Analysis of Heap Behavior • Detecting Initializing Writes • Performance Analysis • Conclusions Avoiding Initialization Misses to the Heap – Mikko Lipasti
Detecting Initializing Writes • Annotate malloc() • Record base, size in allocation range cache • Key questions • What is working set? • How are ranges represented? • Valid bits? Not scalable for 100M allocation • Base + bound • How are ranges updated on writes? • Split vs. truncate Avoiding Initialization Misses to the Heap – Mikko Lipasti
Allocation Working Set • 4-8 entries sufficient, except parser needs 64 Avoiding Initialization Misses to the Heap – Mikko Lipasti
Sequential Initialization • Forward sweep captures 90%+ except • Bzip, gzip, perl Allocated-Invalid Initialization Tracking Initialized Pattern Scheme Unknown 1. Sequential 1. Forward Sweep A A B C D E F B C D E F B B A C D E F A C D E F C C A B D E F A B D E F D D A B C E F A B C E F Avoiding Initialization Misses to the Heap – Mikko Lipasti
Alternating Initialization • Bidirectional captures 90%+ of perl • Doesn’t help bzip or gzip Allocated-Invalid Initialization Tracking Initialized Pattern Scheme Unknown 2. Alternating 2. Bidirectional Sweep A A B C D E F B C D E F F F A B C D E A B C D E B B A C D E F A C D E F E E A B C D F A B C D F Avoiding Initialization Misses to the Heap – Mikko Lipasti
Striding Initialization • Interleaving captures 90%+ of gzip • Still only 60% of bzip • Bzip has a large allocation with random initialization Allocated-Invalid Initialization Tracking Initialized Pattern Scheme Unknown 3. Striding 3. Interleaving A A B C D E F C E B D F C C A B D E F A E B D F E E A B C D F A C B D F B B A C D E F A C E D F Avoiding Initialization Misses to the Heap – Mikko Lipasti
Talk Outline • Motivation • Analysis of Heap Behavior • Detecting Initializing Writes • Performance Analysis • Conclusions Avoiding Initialization Misses to the Heap – Mikko Lipasti
SimOS-PPC -AIX 4.3.1 -Disk driver -E’net driver Block Simple PharmSim -OOO Core -Gigaplane Ethernet PharmSim Overview • Device simulation, etc. from SimOS-PPC [IBM ARL] • PharmSim replaces functional simulators • Full OOO core model, values in rename registers • Supports priv. mode, MMU, TLB, exceptions, interrupts, barriers, flushes, etc. • Lead developer: Trey Cain (thanks Trey!) Avoiding Initialization Misses to the Heap – Mikko Lipasti
Operating System Effects • Widely accepted for SPECINT: • Safe to ignore O/S paths • Most popular tool (Simplescalar) • Intercepts system calls • Emulates on host, updates “flat” memory • Returns “magically” with cache contents intact • We have found that [CAECW2002]: • Omitting system references leads to dramatic error (5.8x L2 miss rate, 100% IPC in worst case) • Specifically, AIX page fault handler eliminates many initializing write misses • Had we not used PHARMsim? • Dramatically overstated performance benefit Avoiding Initialization Misses to the Heap – Mikko Lipasti
AIX Page Installation • Heap manager calls sbrk • Heap manager calls sbrk • Malloc returns block < 4KB • Heap manager calls sbrk • Malloc returns block < 4KB • Program writes to block • Heap manager calls sbrk • Malloc returns block < 4KB • Program writes to block • First reference causes page fault • Heap manager calls sbrk • Malloc returns block < 4KB • Program writes to block • First reference causes page fault • AIX installs entire page using dcbz Unallocated Unallocated Allocated Valid Data segment Avoiding Initialization Misses to the Heap – Mikko Lipasti
Block vs. Page Installation • Page installation • Practically free as part of page fault • Shortcomings of page installation • Pollutes cache • Not scalable to superpages (AIX v5.1) • Does not work for heap reuse • Our short simulations don’t show this benefit • I.e. high overlap between initializing writes and first reference to extended data segment Avoiding Initialization Misses to the Heap – Mikko Lipasti
Integrating ARC Avoiding Initialization Misses to the Heap – Mikko Lipasti
Speedup • Very aggressive core model • Still can’t tolerate all store miss latency • Block mode slightly better than page mode • Cache pollution, less coverage Avoiding Initialization Misses to the Heap – Mikko Lipasti
Program Phase Behavior • Only benefits initialization program phase • Some programs initialize throughout execution Avoiding Initialization Misses to the Heap – Mikko Lipasti
Conclusions • Initializing writes • Cause 23% of all misses in 2MB L2 • Avoid miss with block or page mode install • Up to 41% performance improvement • Subject to initialization:computation ratio • Tracking allocation ranges • Working set very small (4-8, 64) • Forward/bidirectional/interleaved sweep enables range truncation Avoiding Initialization Misses to the Heap – Mikko Lipasti
Acknowledgments • Originated as course project: • Gordie Bell, Trey Cain, Kevin Lepak • PHARMsim infrastructure • Lead developer: Trey Cain • Financial and equipment support • IBM and Intel Corp • National Science Foundation • University of Wisconsin Avoiding Initialization Misses to the Heap – Mikko Lipasti
Questions? Avoiding Initialization Misses to the Heap – Mikko Lipasti
Backup Slides Avoiding Initialization Misses to the Heap – Mikko Lipasti
Invalid Memory Traffic • Real data traffic that transfers invalid data • Initializing Store • Initial write to a storage location that contains invalid data Avoiding Initialization Misses to the Heap – Mikko Lipasti
Allocation Analysis • Single dominant allocation vs. • Numerous moderate allocations Avoiding Initialization Misses to the Heap – Mikko Lipasti
Initialization of Heap • 88% initialized by store miss • Relatively little temporal reuse of freed memory Avoiding Initialization Misses to the Heap – Mikko Lipasti
Fetch Translate Decode Execute Mem Commit PharmSim Pipeline • Substantially similar to IBM Power4 • Some instructions “cracked” (1:2 expansion) • Others (e.g. lmw) microcode stream • Mem Stage • Interface to 2-level cache model • Sun Gigaplane XB snoopy MP coherence • Caches contain values, must remain coherent • No cheating! • No “flat” memory model for reference/redirect Avoiding Initialization Misses to the Heap – Mikko Lipasti
Machine Model Unrealistically aggressive model to devalue the impact of store misses. • 8-wide, 6-stage pipeline • 8K entry combining predictor • 128 RUU, 64 LSQ entries, 64 write buffers • 256KB 4-way associative L1D cache • 64KB 2-way associative L1I • 2MB 4-way associative L2 unified cache • All cache blocks are 64 bytes • L2 latency is 10 cycles • Memory latency is 70 cycles. Avoiding Initialization Misses to the Heap – Mikko Lipasti