330 likes | 343 Views
This study examines the impact of code layout on JVM performance, specifically focusing on object layouts, GC algorithms, heap layouts, and hash codes. Results reveal counter-intuitive findings and possible reasons for the observed performance differences.
E N D
CODE LAYOUT AS A SOURCE OF NOISE IN JVM PERFORMANCE DAYONG GU, CLARK VERBRUGGE AND ETIENNE GAGNON {dgu1,clump}@cs.mcgill.ca egagnon@sablevm.org
Outline • Original motivation • Counter-intuitive result • Possible reasons • Experimental design • The impact of code layout • Conclusions and future work
Original motivation • SableVM • Optimize the garbage collector of SableVM • Three object layouts • Traditional, Bi-Directional and Mixed • Implement different algorithms on these layouts • Performance monitoring unit
Object layouts Traditional Layout Bi-Directional Layout C • Mixed layout is a hybrid layout • For Reference arrays, use Traditional layout • For other objects and arrays, use Bi-Directional layout C B A Header B A Header
Copying GC • The important work is tracing references • We designed different algorithms to do the tracing in different ways
Tr: need offset GC algorithms (Tr Vs Bi) Bi: need check Tr (Traditional) Bi (Bi-Directional) From_space To_space From_space To_space A A Offset table
Tr: need offset GC algorithms (Bi Vs Mixed) Bi: need check Mixed Bi Mixed: better for ref-array To_space To_space Reference Array A Reference Array A
Tr: need offset GC algorithms (Bp) Bi: need check Bp Mixed: better for ref-array To_space From_space Bp: save work for object with many references • The main overhead of Bi is the “reference-header check” • We designed a new algorithm to eliminate this overhead • We call it as Backward- pointer (Bp) algorithm • Other than the “check”s, it also saves many other works: • Some copies • Re-calculation of the skip length • Great for objects with many references • Can be used on both Bi-Directional and Mixed layouts five A
Tr: need offset GC algorithms (RS) Bi: need check Rs • Reference Section (RS) • From a new perspective: • Don’t trace each object • Trace each reference section • Save the start and end address of each section in the end of the to_space, in address decreasing style • Can skip many objects in one jump • Can be used on both Bi-Directional and Mixed layouts Mixed: better for ref-array From_space To_space … … Bp: save work for object with many references RS: save work by skipping all objects without references B … Intuitively: RS, Bp > Mixed > Bi > Tr A
Performance monitoring unit • Advanced processors have a set of registers to measure hardware events • Examples of measurable events • Cache misses • Machine cycles • Instructions • Branch predictors, TLB, System stall … • Many tools have been developed to access performance monitoring unit • We use PCL
GC result • We found the reasons for the counter-intuitive result of GC • Bi costs more instructions than Tr • Mixed costs slightly more instructions than Tr • Bp and RS: • Do save some instructions on our benchmarks • Bp: up to 10% • RS: up to 7% • In general, the number of references in object is small in these benchmarks • Most objects have reference fields in these benchmarks • Cost many more data cache misses than Tr • But there are mysterious performance changes in the mutator • All code changes are in GC part
Possible reasons • Where are the differences at the source code level • Object layout • GC algorithm • Heap layout • Hash code
Heap layout Traditional Bi-Directional • Traditional: • ABC • From SuperSub • Bi-Directional • CBC • From SubSuper B C D A B D C C D B Tro BPR RSR
Possible reasons • Where are the differences at the source code level • Object layout • GC algorithm • Heap layout • Hash code • A series of experiments are designed to test the impact of each of the possible reasons
Hash code cannot be the reason • Immediately, we find the “Hash code” cannot have a significant effect • Most of the benchmarks (except javac) only use a hash code for exactly one object • For javac, only 0.3% objects need a hashcode
Impact of different algorithms • New GC algorithms Vs the simple GC algorithms in Bi_Directional and Mixed object layout
Impact of different heap layout • Same algorithm, same object layout
Other factors? • We cannot get a consistent result, and cannot find which one is a root cause • Noise? • We have made large effort to reduce the possible noise: • We do all measurements on an isolated machine, single user, single application • We run each benchmark 10 times and get the average of the median 4 • We have an extra pre-execution to eliminate the cold-start effect
Other factors? • The measurement of hardware counters is very precise • The standard deviation normalized to the average value is no more than 0.0008 for all measured events • The result is stable and reproducible • Clearly, other unexpected sources of noise exist! • We turn to study the binary level differences
Change code from RSRSR CODE of RS and RSR0: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.Begin_Address; RefIndHead-- = S.End_Address; … While ( to_head < to_tail) { if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head < *curInd ) { *to_head++ = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR1: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head < *curInd ) { *to_head++ = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR2: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR3: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { Define a local sec_start; if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR4: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { Define a local sec_start; if ( to_tail < RefIndHead ) { to_head = *curInd --; sec_start = to_head; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR5 and real RSR: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { Define a local sec_start; if ( to_tail < RefIndHead ) { to_head = *curInd --; sec_start = to_head; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } to_head = sec_start; Normal Algorithm; } Obtain sections • RSR is a variation of RS, the only differences is the scan order which may cause different heap layout after a GC • We change code from RS to RSR step by step and get a series of similar versions • Test on mpegaudio (no GC) • The differences are trivial • The code is not executed! Scan references • Since there is no GC, the heap layout is not changed at all • All the versions should give same performance • But, that is not the case!
Result of RSRSR • In general, the performance is different • But some variations share similar performance • We check the code layout (offside of each method in the binaries of sablevm executable and sablevm library ) • We find there are close relations between the non-trivial hardware performance changes and the trivial code layout difference • The differences in code layout must impact performance • How much it could be? • Would it make the measurement of performance for other techniques less credible?
Experiment for code layout • Pick the RS version, not change any actual meaningful code • Only shift the code layout • From the base version, insert extra string in the beginning part • Force the code to shift with offset from 2 bytes to 128 bytes
Experiment for code layout • Use the benchmark mpegaudio • No GC • All the execution is in mutator part • Only the code layout has chance to change the performance here • Test on Pentium III processor • 32 bytes long instruction cache line • Fetch two lines in one time
Code layout result (I-Cache miss) Misses 37% 64 = 32 *2 Δ -Offset
Code layout result (machine cycle) Cycles 2.7% 64 Δ -Offset
Code layout result (D-Cache miss) Misses 20% Δ -Offset
Result • Trivial changes in code layout can contribute a large difference in performance, from our data: • up to 2.7% for machine cycles, with a recurring pattern • up to 37% for instruction cache misses, with a very clear recurring pattern • up to 20% for data cache misses • The changes are correlated to the hardware structure
Conclusions • We have shown the obvious impact of code layout effects on benchmarks • These effects are non-trivial and can perturb intended measurements • Should be taken into account in order to achieve accurate benchmarking • A better code layout can potential improve performance
Future work • In order to make precise measurements of particular techniques, we should remove the impact of code layout • A binary editor is needed • Or, a flexible, smart linker • We are exploring techniques to take advantage of the impact of code layout • Compile time, link time and runtime optimization techniques can be used • Some link time optimizations do exist, but are not commonly used
References: • SableVM : • http://www.sablevm.org • PCL: • http://www.fz-juelich.de/zam/PCL/