1 / 33

CODE LAYOUT AS A SOURCE OF NOISE IN JVM PERFORMANCE

This study examines the impact of code layout on JVM performance, specifically focusing on object layouts, GC algorithms, heap layouts, and hash codes. Results reveal counter-intuitive findings and possible reasons for the observed performance differences.

sconsuelo
Download Presentation

CODE LAYOUT AS A SOURCE OF NOISE IN JVM PERFORMANCE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CODE LAYOUT AS A SOURCE OF NOISE IN JVM PERFORMANCE DAYONG GU, CLARK VERBRUGGE AND ETIENNE GAGNON {dgu1,clump}@cs.mcgill.ca egagnon@sablevm.org

  2. Outline • Original motivation • Counter-intuitive result • Possible reasons • Experimental design • The impact of code layout • Conclusions and future work

  3. Original motivation • SableVM • Optimize the garbage collector of SableVM • Three object layouts • Traditional, Bi-Directional and Mixed • Implement different algorithms on these layouts • Performance monitoring unit

  4. Object layouts Traditional Layout Bi-Directional Layout C • Mixed layout is a hybrid layout • For Reference arrays, use Traditional layout • For other objects and arrays, use Bi-Directional layout C B A Header B A Header

  5. Copying GC • The important work is tracing references • We designed different algorithms to do the tracing in different ways

  6. Tr: need offset GC algorithms (Tr Vs Bi) Bi: need check Tr (Traditional) Bi (Bi-Directional) From_space To_space From_space To_space A A Offset table

  7. Tr: need offset GC algorithms (Bi Vs Mixed) Bi: need check Mixed Bi Mixed: better for ref-array To_space To_space Reference Array A Reference Array A

  8. Tr: need offset GC algorithms (Bp) Bi: need check Bp Mixed: better for ref-array To_space From_space Bp: save work for object with many references • The main overhead of Bi is the “reference-header check” • We designed a new algorithm to eliminate this overhead • We call it as Backward- pointer (Bp) algorithm • Other than the “check”s, it also saves many other works: • Some copies • Re-calculation of the skip length • Great for objects with many references • Can be used on both Bi-Directional and Mixed layouts five A

  9. Tr: need offset GC algorithms (RS) Bi: need check Rs • Reference Section (RS) • From a new perspective: • Don’t trace each object • Trace each reference section • Save the start and end address of each section in the end of the to_space, in address decreasing style • Can skip many objects in one jump • Can be used on both Bi-Directional and Mixed layouts Mixed: better for ref-array From_space To_space … … Bp: save work for object with many references RS: save work by skipping all objects without references B … Intuitively: RS, Bp > Mixed > Bi > Tr A

  10. Performance monitoring unit • Advanced processors have a set of registers to measure hardware events • Examples of measurable events • Cache misses • Machine cycles • Instructions • Branch predictors, TLB, System stall … • Many tools have been developed to access performance monitoring unit • We use PCL

  11. Counter-intuitive result for GC NO GC!

  12. GC result • We found the reasons for the counter-intuitive result of GC • Bi costs more instructions than Tr • Mixed costs slightly more instructions than Tr • Bp and RS: • Do save some instructions on our benchmarks • Bp: up to 10% • RS: up to 7% • In general, the number of references in object is small in these benchmarks • Most objects have reference fields in these benchmarks • Cost many more data cache misses than Tr • But there are mysterious performance changes in the mutator • All code changes are in GC part

  13. Mutator changes

  14. Possible reasons • Where are the differences at the source code level • Object layout • GC algorithm • Heap layout • Hash code

  15. Heap layout Traditional Bi-Directional • Traditional: • ABC • From SuperSub • Bi-Directional • CBC • From SubSuper B C D A B D C C D B Tro BPR RSR

  16. Possible reasons • Where are the differences at the source code level • Object layout • GC algorithm • Heap layout • Hash code • A series of experiments are designed to test the impact of each of the possible reasons

  17. Hash code cannot be the reason • Immediately, we find the “Hash code” cannot have a significant effect • Most of the benchmarks (except javac) only use a hash code for exactly one object • For javac, only 0.3% objects need a hashcode

  18. Impact of different algorithms • New GC algorithms Vs the simple GC algorithms in Bi_Directional and Mixed object layout

  19. Impact of different object layouts

  20. Impact of different heap layout • Same algorithm, same object layout

  21. Other factors? • We cannot get a consistent result, and cannot find which one is a root cause • Noise? • We have made large effort to reduce the possible noise: • We do all measurements on an isolated machine, single user, single application • We run each benchmark 10 times and get the average of the median 4 • We have an extra pre-execution to eliminate the cold-start effect

  22. Other factors? • The measurement of hardware counters is very precise • The standard deviation normalized to the average value is no more than 0.0008 for all measured events • The result is stable and reproducible • Clearly, other unexpected sources of noise exist! • We turn to study the binary level differences

  23. Change code from RSRSR CODE of RS and RSR0: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.Begin_Address; RefIndHead-- = S.End_Address; … While ( to_head < to_tail) { if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head < *curInd ) { *to_head++ = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR1: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head < *curInd ) { *to_head++ = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR2: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR3: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { Define a local sec_start; if ( to_tail < RefIndHead ) { to_head = *curInd --; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR4: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { Define a local sec_start; if ( to_tail < RefIndHead ) { to_head = *curInd --; sec_start = to_head; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } Normal Algorithm; } CODE of RSR5 and real RSR: curInd = RefIndHead = end_of_heap; … Find a reference section S ; RefIndHead-- = S.End_Address; RefIndHead-- = S.Begin_Address; … While ( to_head < to_tail) { Define a local sec_start; if ( to_tail < RefIndHead ) { to_head = *curInd --; sec_start = to_head; While (to_head>= *curInd ) { *to_head-- = copy_obj ( to_head, &to_tail); } … } to_head = sec_start; Normal Algorithm; } Obtain sections • RSR is a variation of RS, the only differences is the scan order which may cause different heap layout after a GC • We change code from RS to RSR step by step and get a series of similar versions • Test on mpegaudio (no GC) • The differences are trivial • The code is not executed! Scan references • Since there is no GC, the heap layout is not changed at all • All the versions should give same performance • But, that is not the case!

  24. Result of RSRSR • In general, the performance is different • But some variations share similar performance • We check the code layout (offside of each method in the binaries of sablevm executable and sablevm library ) • We find there are close relations between the non-trivial hardware performance changes and the trivial code layout difference • The differences in code layout must impact performance • How much it could be? • Would it make the measurement of performance for other techniques less credible?

  25. Experiment for code layout • Pick the RS version, not change any actual meaningful code • Only shift the code layout • From the base version, insert extra string in the beginning part • Force the code to shift with offset from 2 bytes to 128 bytes

  26. Experiment for code layout • Use the benchmark mpegaudio • No GC • All the execution is in mutator part • Only the code layout has chance to change the performance here • Test on Pentium III processor • 32 bytes long instruction cache line • Fetch two lines in one time

  27. Code layout result (I-Cache miss) Misses 37% 64 = 32 *2 Δ -Offset

  28. Code layout result (machine cycle) Cycles 2.7% 64 Δ -Offset

  29. Code layout result (D-Cache miss) Misses 20% Δ -Offset

  30. Result • Trivial changes in code layout can contribute a large difference in performance, from our data: • up to 2.7% for machine cycles, with a recurring pattern • up to 37% for instruction cache misses, with a very clear recurring pattern • up to 20% for data cache misses • The changes are correlated to the hardware structure

  31. Conclusions • We have shown the obvious impact of code layout effects on benchmarks • These effects are non-trivial and can perturb intended measurements • Should be taken into account in order to achieve accurate benchmarking • A better code layout can potential improve performance

  32. Future work • In order to make precise measurements of particular techniques, we should remove the impact of code layout • A binary editor is needed • Or, a flexible, smart linker • We are exploring techniques to take advantage of the impact of code layout • Compile time, link time and runtime optimization techniques can be used • Some link time optimizations do exist, but are not commonly used

  33. References: • SableVM : • http://www.sablevm.org • PCL: • http://www.fz-juelich.de/zam/PCL/

More Related