Agenda

Prefetch Injection Based on Hardware Monitoring and Object MetadataAli Adl-TabatabaiRick HudsonMauricio SerranoSreenivas SubramoneyStarJIT Compiler TeamProgramming Systems Lab

Agenda • Introduction to the problem • Java/CLI on IPF technology overview • The ORP virtual machine • Dynamic profile-guided optimization (DPGO) • Pointer Compression • Mississippi Delta prefetch

The Memory Problem

Research Platform SW • ORP research virtual machine for Java and CLI • Flexibility through interfaces: different JITs and GCscan be loaded at runtime • Includes high performance garbage collector • Optimizing StarJIT dynamic compiler • Supports aggressive optimizations, PMU-based dynamic profile-guided optimization, automated prefetching, synchronization optimizations • ORP’s performance is competitive with best commercial VMs

Research Platform HW • Itanium 2 architecture • 64-bit processor and address space, many registers • In-order execution, so cache/TLB misses stall the pipeline • 4 Processor Commercial Itanium®-2 • 1.5 GHz • 16 Gbyte memory, 6Mbyte L3 cache

The ORP GC • Intel’s high-performance research GC infrastructure • Supports many configurations • IA-32 & IPF • Java & CLI • Goals • Research infrastructure • Designed for rapid prototyping and high performance • Hardware research • Interface that supports various GC algorithms • Competitive performance

SPEC* JBB2000 Benchmark • Java enterprise benchmark • Multithreaded application stresses multiprocessor systems • Measures sustained throughput (transactions per minute) over several minutes • Focus of task force to get Java/CLI performance up on IPF

SPEC JBB2000 Performance Over Time

Time Distribution 52.9% 14% 10% 18.7%

Others GC 2.0% StarJIT 3.2% VM 0.0% 13.1% Java code (lib) 15.6% Java code (app) 66.1% SPECjbb Cycle Breakdown Memset: 1% • Copying strings: 8% • Object allocation: 1.7%

Key Software Technologies for IPF • Aggressive StarJIT optimizations • Dynamic profile-guided optimization (DPGO) • Leverage Itanium® Processor Family’s Performance Monitoring Unit (PMU) hardware to detect hot spots • Compressed pointers • Compress 64-bit pointers to 32 bits to reduce memory footprint • Object prefetching • Use the PMU and the GC to detect frequent cache misses • Use this information to allow the JIT to recompile and add prefetches effectively

JIT’ed code (Instrumented) code Optimized code DPGO System Architecture • Profile manager • Directs profile collection • Processes & analyzes profile • Decides which methods to recompile & when • Passes processed profile between JIT & GC • StarJIT optimizes relevant methods • GC creates, computes and maintains global object properties VM Profile Manager GC PMU Driver StarJIT Profile Heap

DPGO Compilation Model • 1st compilation: Fast & simple • Opt level 0 – minimal optimization (class initializers) • Opt level 1 – fast optimizations & profiling support • 2nd compilation: profile-guided optimization (opt level 2) • StarJIT IR edge profile annotations • Aggressive profile-guided optimizations • Load maps for hardware data cache miss profile • 3rd compilation: profile-guided data prefetch (opt level 3) • Optimizations from level 2 • Profile-guided object prefetch Profile-guided recompilation concentrates optimizations

Improving 64-Bit JavaIPF Performance byCompressing Heap References Ali Adl-Tabatabai Jay Bharadwaj Michal Cierniak Marsha Eng Jesse Fang Brian Lewis Brian Murphy Jim Stichnoth Programming Systems Lab Intel Corporation

Overview: our technique • Reduce memory footprint using pointer compression • A software technique for Java and CLI virtual machines • Uses cache more effectively • Compress 64-bit references to 32 bits • 64-bit references waste space when data fits in ~4GB • 11% performance improvement on SPEC JBB2000 Pointer compression improves performance by reducing cache and TLB misses

Pick a category of pointers to compress Treat pointers as 32-bit offsets from base of memory area Memory area is contiguous Compress by subtracting base Uncompress by adding base Modify VM, GC, JIT Widespread but shallow changes in VM, GC Sophisticated changes to JIT How to compress

static method ptr Heap O(1 GB) VTables O(100 KB) Code O(1 MB) VM, misc. O(10 KB) cached vtable ptr constant string ptr static reference field What to compress? Compress pointers stored in heap for biggest payoff (object references, vtable pointers)

+ heap_base + vtable_base + Compressed pointers

Vtable pointer (unused) Thread ID Recursion Hashcode Vtable offset Thread ID Recursion Hashcode Compressed object headers • Reduce object’s vtable pointer from 64 to 32 bits • Reduce synchronization header to 32 bits • Maintains 8-byte alignment of first field

Headers: VM Support VtableSpace • Ensure vtable space aboveand within 4GB of vtable_base • Compression: subtract vtable_base before writing vtable pointer into new object • Done once per object - vtable pointer never changes • Decompression: add vtable_base to form the raw vtable pointer before dereferencing vtable_base

Headers: GC support • Mostly need to uncompress vtable pointers before dereferencing them • Vtable holds ref field offsets used by GC scan • Problem: During a collection, GC hijacks header for use as a forwarding pointer • Common trick in modern GC implementations • Solution: Compress the forwarding pointer

Vtable pointer (unused) Thread ID Recursion Hashcode Vtable offset Thread ID Recursion Hashcode Headers: Complexity • Simple, straightforward code to VM, GC, and JIT • Excellent payoff: • 5% SPEC JBB2000 improvement with moderate work • Based on removing 8 bytes per object

Compressed fields • Compress object references stored in heap • Reference-type fields in Java objects • Elements of reference arrays • (Optionally) static fields • Subtract heap_base before writing reference to heap • Add heap_base after loading compressed reference

null Compressed fields: null • Problem: NULL (i.e., 0) is not contiguous to the heap • Solution: new flavor of null • Managed null(64-bit heap_base) • Used in managed code • 32-bit zero when compressed • Can still use memset to clear large heap blocks • Platform null (64-bit zero) • Used in native C code null Redefine null to make simplecompression scheme work

Fields: VM support • Modify all field accesses in VM code • Add code to compress/decompress • Numerous occurrences in ORP • Translate between managed and platform null on managed ↔ native code transfers • Optional: maintain heap_base in a preserved register • Can reduce the number of instructions in theJIT-generated prolog for each method

Fields: GC support • Handle compression • Object scanning code • Object moving code • Enumeration of compressed roots • Ensure heap space above and within 4GB of heap_base • Optionally align at 4GB boundary to simplify compression • No need to subtract heap_base during compression:just store low-order 32 bits

Fields: JIT support • We added compressed reference to type system • Several optimizations are required to reduce compression and decompression overhead • Needed in both the global optimizer and the backend • Essential for achieving full performance

Fields: complexity • Limited changes to GC • Similar in complexity to those for header compression • Moderate, widespread changes to VM • More complex than for header compression • Sophisticated changes to JIT • Much more complex than for header compression • Good payoff: 6% SPEC JBB2000 improvement

14% 13% 12% 11% 10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% Speedup over baseline SPECJBB 2000 Compressed Compressed Both headers references

800 D-cache stalls Thousands 700 DTLB stalls 600 500 I-cache andITLB stalls 400 Branchmispredictionstalls Savings 300 200 Other stalls 100 Unstalledcycles 0 Costs -100 Compressed Compressed Both Headers References Reduction in cycles per transaction Memory performance gains outweigh de/compression costs

Reduction in Heap Space Allocated Reduction in GCs Compressed Headers 14.4% 13.2% Compressed References 13.5% 14.3% Both 27.4% 25.4% Reduction compared to baseline Reduction in heap allocated and GCs Less storage allocated  fewer GCs

70% Compressed 65% headers 60% 55% Compressed references 50% 45% Both 40% 35% Speedup Over Base 30% 25% 20% 15% 10% 5% 0% -5% mtrt db compress jess mpegaudio jack javac SPEC JVM98 Performance Header compression almost always helpsField compression helps if enough pointer use

Extensions • Increase 4GB limit • Object address are 8-byte aligned in ORP • Shift by 3 bits when de/compressing  32GB limit • Compress different kinds of pointers • Method pointers in vtables • Other VM data structures

Compression Summary • A simple pointer compression scheme • Compress object pointers, vtable pointers in heap • Treat pointers as 32-bit offsets from base of memory area • Requires sophisticated JIT optimizations • Special treatment for null values • 11% improvement gain on SPEC JBB2000 Compressed pointers reduce memory stalls

Mississippi Delta Prefetch • New software prefetch algorithms focusing on linked data structures • Integrates hardware performance monitor, GC and JIT to inject prefetches • Abstracts hardware cache miss sampling up to metadata • Leverages GC to discover and maintain useful global properties • Evolves GC into memory hierarchy controller and optimizer • Implemented in high performance fully dynamic system and shows a 14% speedup

Traversing Linked Data Item Item String Character Array brandInfo value Memory Access Time Data dependencies precludes prefetch

Sequential Memory Addresses D to Char Array D to String Leverage Object Placement Item Item String Character Array brandInfo value Memory Access Data placement enables prefetching Time

Leveraging Metadata • Metadata graph • Nodes represent types • Edges represent reference fields or array reference elements connecting nodes • Edge annotations: • Miss latencies • Deltas between objects along edges • Goals • Find paths causing cache misses • Type level summary of linked data structure traversals • Traversals inducing high latency misses • Inject prefetches based on deltas along paths • Avoid reasoning about raw addresses

Metadata Graph red = cache line delta Item name 0,1 line brandInfo 2,3 lines String String value 0,1 line value -1 line Char array Char array SPEC JBB2000 fragment that causes high latency misses

Object Prefetch Algorithm Raw • Hardware Performance Monitoring Unit (PMU) samples cache misses • Abstract up to metadata • Discover high latency paths • Find deltas along paths • Recompile inserting prefetches PMU Metadata Paths Deltas Inject Prefetch Abstract

Sampling Cache Misses • HW PMU delivers samples • IP of the load causing the miss • Effective address of miss • Miss latency • Low overhead • SW instrumentation can’t tell if a load misses PMU Metadata Paths Deltas Inject

Abstract HW Samples • IP delinquent load • Miss address delinquent object • Delinquent objects delinquent types • Tolerates partial or inaccurate data • Hardware can’t reason at the object or type level PMU Metadata Paths Deltas Inject

Discovering Delinquent Paths • Delinquent paths abstract high latency linked data structure • Approximates traversals causing latency • Discover edges along the paths • Piggyback on GC mark phase traversal • Characterize edges GC encounters • Glean global properties about edges • Use characterization to estimate path • Combine the edges into paths PMU Metadata Paths Deltas Inject

Edge Characterization Delinquent Object Delinquent Object Rare when using sampling, but indication of frequently used path Delinquent Object Delinquent Type Valuable, indication of how you reach a delinquent type Delinquent Type Delinquent Object Less valuable, since many types can point to a single object Delinquent Type Delinquent Type Not useful, too many edges

Item brandInfo String value Char Array Determining Delinquent Paths • Apply filter based on edge characterization • Build larger paths recursively • Start with single edge paths • When child type matches parent type combine into longer path (list) • Combine paths with common bases (tree) • Base is first type in path PMU Metadata Paths Deltas Item String Inject brandInfo value String Char Array

Finding Deltas Along Path • Process set of delinquent objects • If type matches the base of a delinquent path • Traverse path • Summarizing deltas between base and objects along the path • Useful deltas exist even without proactive placement by GC PMU Metadata Paths Deltas Inject

Maintaining Deltas • GC must be delta aware • Allocation order placement • Frontier pointer allocation creates useful deltas • Sliding compaction algorithm maintains allocation order deltas • Various GC algorithms break/alter deltas • Expands GC role into a memory hierarchy controller and optimizer PMU Metadata Paths Deltas Inject

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA